Snowflake for Biomedical Research – Learning, Thinking, and Coding

I’ve since left biomedical research (the Genomics team at Novartis Institutes for Biomedical Research - NIBR) and am now doing health analytics with Carrot Health. At Carrot Health, we are making use of Snowflake Computing as our data storage and query system. I love Snowflake and there are so many features that make it better than what we had at NIBR. In the spirit of “Top 10 Cool Things I Like About Snowflake”, I bring you my 10 reasons that Snowflake works for biomedical research.

Reason 1 - All in the Cloud (AWS or Azure) - no DBA, no hardware, no tuning.

You can be up and running instantly in your cloud of choice. There are even safety features like UNDROP, should you DROP a table or view by mistake. For bio folks - if you are a small lab and don’t have DBAs and hardware admins, etc., then no problem. And if you do, then the security features (below) should be enough to get your IT department to buy-in.

Reason 2 - Persistent (cached) results (with a visual execution plan)

In Snowflake, if you (or anyone) executes a query with a result set, that result is cached and used either in other query plans or as a direct result with the same SQL. This might mean that the first time you execute a query it might need some horsepower, but then for 24 hours after that, you’ll get that data back instantly. This is great for bio folks that are typically working on a particular project’s data - joined with large public datasets. Those can be cached for a while and then when you are on to the next project, they will just get removed from cache. And if you aren’t sure, you can easily go into Query History and look at the visual execution plan.

Reason 3 - The functions you need - pivot / unpivot, analytic functions and UDFs

Let’s face it - bio folks like their data in matrix form sometimes - pivot and unpivot in the database is great. Having the ability to do a wide variety of analytic functions can help with basic statistics. And having the ability to add your own functions is great too - but ECMAscript only.

Reason 4 - JSON in SQL

Snowflake supports a VARIANT column type that can hold JSON data and it has the SQL extensions to query that data. This is super useful for mixing structured and semi-structured data together. And that is key for aggregating bio data - because we can almost agree on most of the structure but then everyone has their extra data that they want to keep.

Reason 5 - Can connect from anything

Snowflake supports ODBC, JDBC, Python, Spark, a web console, and it’s own snowsql command. You can basically use any tool to get connected. We were able to easily add support for SchemaSpy and Flyway (JDBC-based tools) for Snowflake - and I typically use DbVisualizer (JDBC) to access it.

Reason 6 - Like a data lake but with SQL

Snowflake has amazing capabilities to both load and unload data to and from S3 (we are in AWS and now Azure) and it’s fast. You can regularly point it at a folder and it knows which files have already been loaded. And you can define that process to happen automagically. There are some file formats that it doesn’t support out of the box - I’ve had to convert some fixed width data to separated-values - but that is minor compared to the built-in infrastructure. For bio folks, I think that this is awesome - getting scientists to put their data in S3 is far easier than helping them get it into the database.

Reason 7 - Data Sharing, Data Cloning, and Time Travel

Snowflake has the ability to share databases between accounts. This means that someday, we could have reference data already loaded (once) in Snowflake and have everyone share it. Or results of a consortium’s work could be shared once. Sharing WITH SQL access. There is also the ability to quickly clone data which is another way that one can share data / parts of data (or promote data from QA to PROD). Snowflake, like Datomic, also has the ability to return results based on the data at a given time. For bio folks, this is exactly what is needed for reproducible research - and/or - for data that changes over time but you don’t want to deal with formal versioning.

Reason 8 - Multiple Databases and Schemas

Snowflake is one of the few systems that supports multiple databases with multiple schemas in them. And all SQL can cross databases and schemas. This helps tremendously with data organization and potentially with role-based sharing rights. And security doesn’t stop there - data is encrypted at rest and in transit and you can even lock down access to your own AWS PrivateLink so traffic never leaves your combined data center / AWS cloud. Snowflake is HIPAA and SOC2 compliant as well.

Reason 9 - Scaling compute vs scaling storage

With Snowflake, the SQL execution “cluster” is called a “warehouse” (horrible name, I know, but there you are). One can size (and resize) a warehouse for the queries at hand - thus having the ability to scale at will as needed (there are even warehouse clusters to get you even more compute if you need). You pay separately for storage and compute but you have tremendous control over it (and access to the accounting). You can even has department only warehouses to enable chargeback policies.

Reason 10 - the bleeding edge is available

Snowflake supports parquet files as well. It would be awesome to try and use ADAM-formatted data - or heck, run a whole Big Data Genomics variant calling pipeline directly on the database. Or could be fun to try a version of hail.is directly on the database. This is something I’d love to see people try - and is only do-able in Snowflake.

So there are my 10 - please feel free comment or email me at brian -dot- repko -at- learnthinkcode -dot- com. I should add that I’m writing this as a way to share my experience with my past co-workers and Snowflake has not asked for this or is supporting this in anyway. My thoughts and opinions only.