Data Lake vs Data Warehouse

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data modeling, architecture, storage? I think their are a few different things that differentiate a Data Lake from a true Data Warehouse, let’s talk.

The Data Warehouse was the behemoth for decades, the unchallenged and always evolving single source of truth … kinda. It spawned a generation of database engineers, etl developers, database administrators, and data analysts. It’s quite amazing if you think about it. Its pinnacle was probably The Data Warehouse Toolkit by Kimball.

But, this too shall pass in a blaze of glory. Kinda.

What killed the classic Data Warehouse?

I know there are still a number of Sql Server and Oracle Data Warehouses floating around out there, there will never go away. But, it’s quite apparent that those days have slowly faded into the background. I think there were a few factors that killed the classic Data Warehouse and replaced it with the Data Lake.

explosion in the size of data.
explosion in velocity of data.
stagnation of ETL tools.
limitations of relational databases.
growth in types of data.
administration overhead.
Hadoop/Spark
cloud storage

I’m sure there are probably more, I just can’t think of them.

Type and Volume of data.

This is one of the major reasons the classic Data Warehouse died and the Data Lake took over. With the rise of streaming and messaging infrastructure it seems like the Data Warehouse didn’t have the answer for JSON and other similar types of data. Sure, I know you can store JSON and SQL Server …. Lord help the developer having to do all the parsing. It was always a little awkward. It just didn’t fell right.

The volume of data started to grow out of hand as well, it was harder and harder for single instance RDBMS to keep pace. Splitting data files on different drives just only get’s you so far. And the companies tried, oh they tried … which lead to the next problem.

Administration Overhead and Architecture Limits

I think this was the second major death nell to the classic Data Warehouse. Let’s be honest, the complexity and troubles of setting up multi-node, sharded, and fail-over clusters required at least two full-time people just to handle. And those solutions were and are not cheap.

And even with all that overhead it still became an issue of query speed and other database problems.

Then a long came the juggernauts, who smelled blood in the water.

Spark + Hadoop or Cloud Storage

Hadoop has been around for a long time, and was used for Big Data, but it wasn’t that mainstream and could never really fully replace the classic Data Warehouse. Why?

Hadoop by itself plus Java = One Big Headache. Even with things like Hive and Pig trying to give SQL like antics, it just wasn’t the same.

Spark solved that problem, then came Cloud Storage, it was the answer.

Spark gave you the power of ANSI SQL (what people use RDBMS for) and Cloud Storage gave infinite reliable storage capacity. Add in the distributed manner of Spark and it’s just too good to pass up.

What makes a Data Lake different than the Data Warehouse?

After all that nonsense what really is the difference between a Data Lake and a Data Warehouse? I really think it comes down to two things.

Data Model
Technology

Data Model

One of the biggest differences between the classic Data Warehouse and the new Data Lakes is the data model. The differences are slight, but noticeable to anyone who spent a decade or two working inside an RDBMS Kimball warehouse.

The Data Lake model is just a little “looser” then the RDBMS Data Warehouse.

Why?

Some of it has to do with the technology and the realization that there is a trade-off between constraints and the end result.

Typically a Data Lake will have less “constraints” that most people are used to with a Data Warehouse.

Data Lakes usually don’t support foreign keys for example. They can exist … they just won’t be enforced. It’s up to the developers just be more rigid and enforce such concepts in the code … not rely on something or someone else to do that.

Also, forcing data all the way into third normal form, with absolutely no duplication is less common.

It’s totally acceptable to have Fact and Dimension type data models in Data Lakes, in-fact it should be encouraged. It’s a great way to deal with and think about data, it makes things approachable and setup well for queries. But, it would be strange to find a Data Lake with as many dimensions as a RDBMS classic Data Warehouse.

Distributed systems like Spark can do distributed broadcast joins, but you should probably not design a data model for a Lake for use with Spark that needs 15 different dimension joins to get the answer.

Think about it like this … the classic Data Warehouse inside a RDBMS is going to have all sorts of keys and tables and constraints. The Data Lake is going to probably have less “tables”, and not as many constraints. They will both have Facts and Dimensions, with the Data Lake being more flexible and probably less complex in the data model, while the classic Data Warehouse will be envelope every aspect of Kimball and dimensional modeling.

Technology

The other obvious difference between a Data Warehouse and a Data Lake is probably going to be the technology used.

Pretty much anything inside a relational database like SQL Server, Oracle, Postgres, MySQL is going to be best suited for the traditional Data Warehouse. The new Data Lake thought is a different animal.

Primarily you will find Cloud Storage like AWS s3 or Google’s GCS used as the backend storage system. Files will probably be either parquet, orc, or avro, maybe flat-files is someone doesn’t know what they are doing.

You will see things like Delta Lake used as an abstraction layer on-top of the files, although not necessary, it can make the data modeling and look and feel more like the classic database.

Also instead of just plain old ANSI SQL on-top of the relational database, the flexibility of tools like Spark, Druid, and the like will be used to do the heavy queries and analytics. These tools are powerful and flexible, they allow you to be not so perfect when it comes to Kimball style Fact and Dimension modeling, and still get the same result, with usually better results.

And don’t forget that with managed Spark and Cloud Storage offered by Databricks, AWS, GCP, or Azure, the overhead of running and managing such massive distributed systems in nothing compared to managing SQL Server Failover Clusters.

Musings

Generally speaking their can be some grey area and overlap between Data Warehouse’s and Data Lakes, but usually the difference becomes obvious when you see it.

The classic Data Warehouse will be inside a relational database, it will probably follow Kimball style data modeling down to the nth degree. It’s usually a source a pride. The Data Lake will probably consist of files stored in the cloud, loosely following a data model, with the heavy lifting happening inside code, like Spark.

I’ve enjoyed the change of pace, the Data Lake is flexible and tools like Databricks are great, they allow you to move complexity in and out of the data model as needed, are quicker to deliver results and less obsessed with getting the perfect third normal form.

Data Lakes are the future.

Data Lake vs Data Warehouse – What’s the Dealio?