, , ,

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

Even I get confused these days. Data Warehouse, Data Lake, and Lake Houses … why do we have three, what are the differences? Is it all just marketing huff-a-luff? Technology and life in the data world seem to be changing fast these days. Lot’s of new vendors on the streets trying to hawk their tools and solutions, each of them pumping out content designed to solve all your data needs.

I’ve seen a lot of content out there by SAAS vendors, and by folks who ascribe to a said vendor, about Data Lakes and Lake Houses, new schema designs and approaches, and it’s hard to know what is just a sales tactic and what is real. I’m going to stir the pot.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. Enjoy.

Part 1 – What are Data Warehouses, Data Lakes, and Lake Houses?

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 4 – Keys To Sucess – Idemptoency and Partitioning.

Part 5 – Serving Data from your Data Warehouse, Data Lake, or Lake House.

Part 1 – What are Data Warehouses, Data Lakes, and Lake Houses?

You know, I would like to think I’ve learned a few things in my life, older and wiser maybe? My wife would beg to differ. Either way there is something that I’ve always used in my data career to help me seperate thoughts, technology, and and ideas. It doesn’t work all the time, but it’s a good starter, it helps me how to understand and how seriously to take something.

“Explain this concept to me in a single sentence.”

– me

We should try to apply this the concept of Data Warehouses, Data Lakes, and Lake Houses. Although I think this might harder than it sounds.

Definitions

Let’s just see what the internet says the the definitons are …

Data Warehouse

“… a copy of transaction data specifically structured for query and analysis”

– Kimball

Data Lake

” … is a centralized repository that allows you to store all your structured and unstructured data at any scale.”

– AWS

Lake House

“… new, open architecture that combines the best elements of data lakes and data warehouses.”

– Databricks

Well, they seem to make less-sense as you go down the list. I’m going to make a few statement of my own here that I believe can summarize the difference between the Data Warehouse, Data Lake, and Lake House. It seems about as clear as the picture below … what is the real difference?

The Real Definitions

Here we go.

  • What people usually mean by Data Warehouse / Data Lake / Lake House is the technology stack they are using.
  • The order of existance is/was … Data Warehouse -> Data Lake -> Lake House.
  • The concepts that drive Data Warehousing also drive Data Lakes and Lake Houses.
  • Data modeling for Data Warehousing is well defined, not so for Data Lakes or Lake Houses.
  • Data Warehouses were blamed for not being able to serve new customers like Machine Learning, and gave rise to Data Lakes.

I have found it helpful to look at the Data Warehouse / Data Lake / Lake House dicussion like a continuum. There are probably a few old school systems that fall squarely into a single category, most likely you might find yourself “in-between” somewhere.

Data Warehouses

Everyone is pretty familiar with Data Warehouses at this point. What is the most common feature about a Data Warehouse.

  • Kimball stype Fact-Dimension tables.
  • Strong Data Normilization.
  • Used to built in a Relational Database System (RDBMS).
  • Built purely for analytics and rollups.

This tricky part is that the lines have gotten blured between a Data Warehouse and a Data Lake the last few years. Folks started shifting towards Snowflake, BigQuery, and Redshift. Many people kept the same dimensional model, but you started to see cracks in the armour.

The downfall of the classic Data Warehouse stored in SQL Server, Oracle, and Postgres etc was the combination of a technology failure and modeling failure.

The Scale and Data Problem.

One of the biggest problems in the last ten years was the growth in the data and the type of data being forced into the Data Warehouse. Even with massive SQL Server failover clusters, the classic Kimball normalized data model just started to have problems with 10’s of TB’s of data and the types of data having to be forced to fit into a relational database.

This is how the Data Lake came to save the day.

Data Lakes

Probably one of the biggest differences that happened between the transition from the Data Warehouse to the Data Lake was the movement of the underlaying technology, which in turn drove other changes. The original Data Warehouses on a relational database couldn’t handle the volume of data, nor could it “easily” capture more un-relational type of data that wasn’t transactional in nature.

  • move from databases to file storage
  • move to big data systems like Spark, Presto, etc.
  • lossening of strictly designed data models
  • support more than analytics … aka ML etc.

Data Lakes became and probably still are in many cases, a contiunation of the Data Warehouse with the addition of new storage (think s3), new tools (think Spark), that can still have fact and dimensions, but also can just have many structure and semi-structure data sinks for other types of data … data sinks they may be partially transformed or not transformed at all.

To me a Data Lake is just like a conglomeration of what can be normal Fact and Dimensions, other un-touched raw data sinks, along with many semi transformed data points, the power being that all the data is sitting in the same platform and can be combined, viewed, and analyized with the same tools … giving birth to a Data Lake.

By definition you probably cannot have a Data Lake inside a relational database.

Data Lakes provide not only analytics that would come from a classic Data Warehouse, but they can also provide features stores for Machine Learning and many other uses. Things a traditional Data Warehouse just cannot provide.

Lake House

Not going to lie, I saved the worse for last. I have to be honest here, I’m a little skeptical to the idea of a Lake House being any different from a Data Lake, but I’m open to change. I persionally feel like the Lake House appears to be mostly marketing hype, but maybe history will prove me to be a liar.

One of the reasons a Lake House is hard to define, is because the answer you get is quite ambigious, and sounds percicesly what many Data Lakes are already doing. It’s hard to know when you have a Data Lake and when you have a Lake House.

“… what you would get if you had to redesign data warehouses in the modern world.”
“… similar data structures and data management features to those in a data warehouse directly on top of low cost cloud storage.”

– Databricks

” … Enter the cloud data lakehouse, a unified approach for all your data use cases from BI to AI.”

– Dremio

“… Data lakehouses implement data warehouses’ data structures and management features for data lakes.”

– Snowflake

I think the key thing here is that if you google “Lake House” and try to figure out what it really is … you will end up on a marketing blog post from many different compaines. That should tell you something.

The Data Lake seems to offer a combination of a Data Warehouse and a Data Lake, but since most Data Lakes were already being stored in the cloud, as parquet or something else, it isn’t really clear what a Data Lake is other then some new “technology.” You will hear about different layers of data … like Gold, Bronze, Silver, this thought methodology has started to creep around thanks to Databricks, but that being said what really does define a Lake House?

I hope this topic of Lake Houses becomes more clear and defined in the future.

Summary

What do you really need to know as a Data Engineer or anyone else about the definition of a Data Warehouse, Data Lake, or Lake House.

I think it comes down to the data model and the technology being used.

If you’re still storing all your data in SQL Server or Postgres, and live and die by the Kimball style models into which all data must be forced … then you have a Data Warehouse.

If your storing all your data is some cloud storage like s3 with some semblance of a data model that has “facts” or aggregates, as well as other semi-transformed data, you have a Data Lake.

If the rep for your SAAS company tells you you have a Lake House, then you have one.