Databricks vs Snowflake. The DataLake/Warehouse Battle.

As someone who worked around the classic Data Warehouses back in the day, before s3 took over and SQL Server and Oracle ruled the day … I love sitting on the sidelines watching new … yet old battle-lines being re-drawn. I could probably scroll back in StackOverflow 12 years and find the same arguments and questions. In one sense Databricks and Snowflake are totally different tools … but are they? Distributed big data processing, apply transforms to data, enable Data Lake / Data Warehouse / Analytics at scale. There is a lot of bleed over between the two, it really comes down to what path you would like to take to get to the same goal.

Databricks vs Snowflake

“Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings.”

Snowflake documentation

“One simple platform to unify all your data, analytics and AI workloads”

Databricks website

Hmmm … it does make you wonder doesn’t it?

Why do Databricks and Snowflake even exist?

This is probably the first question we should ask … why did these two tools come into existence in the first place. I think the basic answer is really …. because the traditional relational database couldn’t keep pace with the modern and growing data trends.

This has to do with the type of data, the volume and size of data being produced and stored by most organizations. Eventually the overhead of running some SQL Server failover cluster just became too much of a pain. Smart people knew there were other ways to end up at the end of the rainbow.

that ol’ oracle or sql server is a beast.

If we look at the quotes from both Snowflake and Databricks about what they are really trying to do, it’s pretty my identical. Data processing. I mean at least we don’t have to wonder about it.

How do you pick between Snowflake and Databricks?

I think this question is probably less technical than most people would guess.

It really comes down to what are you doing with your data today, and what do you want to be able to do with it?

questions, questions, questions.

Of course there are certain things that each of these platforms are better at then the other. Databricks as the power of Spark and things like SparkML behind them. We can nit pick this and that and make a list of certain features or processes that each of the tools either doesn’t do well, or does better then the other. It may be helpful in some circumstances to do that, but I’m guessing to answer the question, “Should I pick Snowflake or Databricks?” or “What is the difference between Snowflake and Databricks?” we probably don’t need to go that deep.

Let’s consider what 80%+ of data pipelines and ETL look like today.

  • read csv, json, or flatfiles
  • transform data
  • present analytical capabilities on-top of that data.

Both Snowflake and Databricks are more then capable of doing each of the above.

Questions to ask yourself before picking Snowflake or Databricks.

  • Is my workforce Programmer of Analyst heavy?
    • aka do you have a team of Data Engineers that writes code well or classic SQL? (your team is going to excel at one or the other)
sneaky old coders

If your coming from the world of classic RDMBS like Postgres, MySQL, Oracle, MS SqlServer … and that’s how your data is being handled today …. this means you would have the easiest transition into Snowflake. Trying to make the switch over to Databricks could make sense, but requires a whole different set of skills ( aka programming).

I believe in the 80/20 rule when it comes to deciding between Databricks and Snowflake. I mean most likely you want a Data Warehouse or Data Lake at the end of the project, both will give you that. The question is going to be how successful are in the project getting from start to finish. That is going to depend on the skill set of the people working on the data project.

It very rarely is about the technical feasibility of if Snowflake or Databricks can read and transform your data into great analytics. It’s more about what skills does your team have that will enable them to do that work.

Team’s that have been programming their whole life will probably pick Databricks … database people who’ve been written SQL since 1995 will pick Snowflake.

  • How complex are my transformations?
    • what do the data transforms look like? Is there gobs of business logic and unique transformations?
complexity is the root of all evil.

The second area of interest when picking between Databricks and Snowflake is going to be transformation driven. Can most technical people replicate programming logic into SQL and back again … yes, should they …. no.

It will probably be obvious but some things are easy in SQL and some things are not. If your transformations are general STRING interpolations and manipulations, SQL will work great. I’ve also written SQL Server stored procedures back in the day mixing recursion and nested CASE statements that made your head spin.

This is the difference between choosing Databricks over Snowflake in this case. Many complex transformations are just going to be easier and better in Spark with DataBricks … why? Just because.

I’m not saying Snowflake is incapable of complex transformations, I’m just saying overall …. which one is going to make your life easier.

  • What are going to do with your data later?
    • once your data is in your Data Lake or Data Warehouse, what are doing with it?

This really isn’t about running “analytical” queries on-top of your transformed data. Both Snowflake and Databricks can do that with ease. This probably has a lot to do with Machine Learning in today’s world. It appears Snowflake is trying to move into that world more, but honestly you can’t beat SparkML if your looking to go all MLOps on your data.

Also the ease by which the transformed data can be pushed and pulled to and from other platforms in your specific situation should probably be thought about. Does the output from your Data Lake get pushed to Postgres every night? Does Tableau need to update all its Dashboards based on the most recent information in your Data Warehouse? You should test and think about these things when it comes to picking Snowflake or Databricks.

Data Lakes and Data Warehouses … how do you want them to look?

the answers to your questions lie in the mirror of introspection.

This might seem like a silly question, but it’s very important. People have certain images in their head when you say Data Warehouse and Data Lake so you should be clear about what the end result will look like before picking Snowflake or Databricks.

I know some would argue that using DeltaLake with DataBricks brings ACID and Schema enforcement like any traditional database would have, but this is not necessarily true.

When building a Data Lake or Data Warehouse some people are going to value different things over others.

Snowflake is going to excel at giving you data and schema enforcement that Databricks cannot give you. There are no such things as Primary and Foreign key’s in the Databricks world … but there is in Snowflake. Can you design your system around this? Yes. Do you want to? I don’t know, what do you want?

If you want to have your Data Lake or Data Warehouse closely resemble what you had in SQL Server and Oracle, with all the data governance and schema enforcement you had before, on steroids … trading that for flexibility and power …. then Snowflake is for you.

If you know that with a good architecture and data model you can still achieve most of your data consistency, governance, and schema enforcement concerns … and you want more power and flexibility on top of that data … then pick Databricks .. there is very little that Spark and DeltaLake cannot do.

Should I pick Snowflake or Databricks?

I don’t know. You should pick the one that best fits your needs and ability to create a Data Lake or Data Warehouse that will actual get finished and solve your problems. Both of their strong suits and their weaknesses. You could pick either one and end up at the same place. They both can ingest all sorts of data, transform it, and produce analytics.

It’s more about the journey to get their, what you want it to look like, and the peripheries that surround that Data Lake or Warehouse. Those are most likely what will make or break the project.