This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. In Part 2 We are digging into the common Big Data tools and how those technologies have a direct impact on Data Models and what kind of Datastore ends up being designed.

Part 1 – What are Data Warehouses, Data Lakes, and Lake Houses?

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 4 – Keys To Sucess – Idemptoency and Partitioning.

Part 5 – Serving Data from your Data Warehouse, Data Lake, or Lake House.

Read more

Even I get confused these days. Data Warehouse, Data Lake, and Lake Houses … why do we have three, what are the differences? Is it all just marketing huff-a-luff? Technology and life in the data world seem to be changing fast these days. Lot’s of new vendors on the streets trying to hawk their tools and solutions, each of them pumping out content designed to solve all your data needs.

I’ve seen a lot of content out there by SAAS vendors, and by folks who ascribe to a said vendor, about Data Lakes and Lake Houses, new schema designs and approaches, and it’s hard to know what is just a sales tactic and what is real. I’m going to stir the pot.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. Enjoy.

Read more

Hive is like the zombie apocalypse of the Big Data world, it can’t be killed, it keeps coming back. More specifically the lesser-known Hive Metastore is the little sneaker that has wormed its way into a lot of Big Data tooling and platforms, in a quasi behind the scenes way. Many people don’t realize it, but Hive Metastore is the beating heart behind many systems, including Databricks. It’s one of those topics that sneaks up on you, ignore it happily at your own peril, till all of a sudden you need to know everything about it.

Specifically, I want to talk about Hive MetaStore as related to Databricks, how it works inside the Databricks platform, and what you need to know. I tripped myself up a lot during my initial forays into Databricks at a Production level. When you wander outside the realm of Notebooks, which you should, strange things start to happen. Databricks seems to assume you already have your own Hive Metastore, maybe like the Glue Data Catalog, or that you want to set up your own somewhere. But what if you don’t?

Read more

Something happens with you starting working with 10’s of billions of records and data sets that are hundreds of TBs in size. Do you know what happens? Things stop working, that’s what. I miss the days where 1-10 TBs were considered large and in charge. the good ole days.

I want to talk about lessons learned from working with MERGE INTO using Databricks Sparks. The suggestions, the marketing material, the internet, and what you actually need to do to gain reasonable performance. It’s easy to say … “here … use this new feature, you will get % 50-speed improvements.” Yeah right. Honestly, new features and fancy tricks always help, but typically it comes down to the fundamentals. The “boring” stuff if you will, that make or break Big Data operations.

Read more

What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. 15 years ago working on SQL Server I would ask myself the same question.

Are they really that different at all? Is it just a matter of preference? Let’s take a look at a few examples of CTE vs Subquery using SparkSQL as an example and see what we see.

Read more

Databricks, easily the hotest tool these days for Data Lakes and Data Warehousing, it’s a beast. As with any new technology there are always growing pains, learnings, and tips and tricks that might not be obvious to those dipping their toes in the water. Not understand certain concepts, and being unware of specific configurations can cost you time and money very easily when running large ETL pipelines on Databricks.

I want to share 7 tips for Databricks newbies, and oldies, that are foundational to good Data Engineering architecture, affecting both performance and cost.

Read more

Data Modeling is a topic that never goes away. Sometimes I do reminisce about the good ol’ days of Kimball-style data models, it was so simple, straightforward, just the same thing for years. Then Big Data happened, Spark happened. Things just changed. There is a lot of new content coming out around Data Lakes and data modeling, but it still seems like a fluid topic, with nothing as concrete as the classic Data Warehouse toolkit.

Oh, what to do what to do. I do believe there are a few key ideas and points to being successful with file-based Data Lake modeling. I think it’s a mistake to fully embrace the classic Kimball-style Data Warehouse approach. It really comes down to Relational Database SQL vs File-Based data models are going to be different, for technical and practical reasons.

Read more

I saw a recent post on r/datengineering, a question centered around why Databricks is so popular when tools like EMR have been floating around for so long. It got me thinking about it. It really isn’t all about the technical side and offerings, although that does play a large role. There are always proponents for every technology, old or new … like our favorite band or sports team, fight to the death for what we love and cherish. I want to talk theoretically, and technically about Databricks and EMR, and why you should use Databricks. 🙂

Read more

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data modeling, architecture, storage? I think their are a few different things that differentiate a Data Lake from a true Data Warehouse, let’s talk.

Read more

As someone who worked around the classic Data Warehouses back in the day, before s3 took over and SQL Server and Oracle ruled the day … I love sitting on the sidelines watching new … yet old battle-lines being re-drawn. I could probably scroll back in StackOverflow 12 years and find the same arguments and questions. In one sense Databricks and Snowflake are totally different tools … but are they? Distributed big data processing, apply transforms to data, enable Data Lake / Data Warehouse / Analytics at scale. There is a lot of bleed over between the two, it really comes down to what path you would like to take to get to the same goal.

Read more