Data Warehousing Archives - Page 2 of 5 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing, Rust

DataFusion courtesy of Rust, vs Spark. Performance and other thoughts.

I think it’s funny that DataFrames are so popular these days, I mean for good reason. They are a wonderful and intuitive way to work with and on datasets. Pandas … the nemesis of all Data Engineers and the lover of Data Scientists. Apache Spark is really the beast that brought DataFrames to the masses. Even those little buggers over at Apache Beam give you DataFrames.

Of course, when anything gets popular, you start getting little things that start to pick and peck at the heels. I would probably say that is what DataFusion with Rust seems to be. Seems more like a contender against Pandas rather than Spark to me. I guess if you’re just using Spark locally or on a single node, sure you could consider using DataFusion. Code available on GitHub.

September 11, 2022

Data, Data Engineering, Data Warehousing

5 things I wish I knew about Databricks … before I started.

How many times in your life, that is but a mist, have you thought, “If I had only known that in the beginning?” I feel as if I’ve committed that cardinal sin as a developer and Data Engineer … falling in love with a tool to the exclusion of all else. I mean truly, Databricks has brought Big Data to the masses, all you need is your laptop and 10 minutes of PySpark training before your spending gobs of money, processing massive amounts of data. Where else, and with what else can you do such things? Try it with EMR, good luck to you.

That being said, when you love something you start to notice the slight imperfections and problems with that something. You get kinda nit-picky. Such is life. I want to save some poor soul out there some heartache, that moment when you’ve been writing code for hours or days, and come upon a little surprise that makes your heart drop into your shoes, and the blood runs to your face. Here are 10 things I wish I knew about Databricks before I started. Maybe it will save you time, help you, who knows.

August 5, 2022

Big Data, Data, Data Engineering, Data Warehousing

Exploring Delta Lake’s ZORDER, and Performance. On Databricks.

I think Delta Lake is here to stay. With the recent news that Databricks is open-sourcing the full feature-set of Delta Lake, instead of keeping the best stuff for themselves, it probably has the most potential to be the number one go-to for the future of Data Lakes, especially within those organizations that are heavy Spark users.

One of the best parts about Delta Lake is that it’s easy to use, yet it has a rich feature set, making it a powerful option for Big Data storage and modeling. One of those features that promise a lot of performance benefits is something called ZORDER. Today I want to explore more in-depth what ZORDER is, when to use it, when not to use it, and most importantly test its performance during a number of common Spark operations.

July 7, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

It still seems like the wild west of Data Quality these days. Tools like Apache Deque are just too much for most folks, and Data Quality is still new enough to the scene as a serious thought topic that most tools haven’t matured that much, and companies dropping money on some tool is still a little suspect. I’ve probably heard more about Great Expectations as a DQ tool than most.

With the popularity of PySpark as a Big Data tool, and Great Expectations coming into its own, I’ve been meaning to dive into what it would actually look like to to use Great Expectations at scale and answer some simple questions. How easy is it to get up and running with Spark, what’s the path of least resistance to getting some basic Data Quality checks in place in a data pipeline.

June 2, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

Reducing Complexity with Databricks + Delta Lake COPY INTO

As the years drag by in Data Engineering, there are a few things that I have come to appreciate more and more. One of those topics that is close to number one on the list is complexity reduction. Today’s modern data stacks are filled to the brim with technologies and tools, full to the brim, and overflowing. So many tools with such wonderful features, sometimes all the magic comes with a downside. Complexity. Complexity can turn something wonderful into a nightmare.

Reducing (not avoiding) complexity seems to be one of the main tenets I work on these days when designing resilient, reliable, and repeatable data pipelines that can process terabytes of data. One of those tools is COPY INTO feature of Databricks + Delta Lake.

April 14, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Data Quality – Great Expectations for Data Engineers

Mmmm … Data Quality … it is a thing these days. I look forlornly back to the ancient days of SQL Server when nobody cared about such things. Alas, we live in a different world, where hundreds of terabytes of data are the norm, and Data Quality becomes a thing. I’ve been meaning to give Great Expectations a poke for like a year, but just haven’t had the time or inclination to do so, but times are changing, and so should I.

I’m not really planning on giving an in-depth guide to Data Quality with Great Expectations, what I’m more interested in are topics like, how easy is it to set up and use, what’s the overhead, what are the main features and concepts and are they easy to understand. I find this sort of review of Data Engineering tools to be more helpful than simply a regurgitation of the documentation.

March 17, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 4 – Keys To Success – Idempotency and Partitioning.

As the road winds on we come to Part 4, of our 5 Part Series on Data Warehouses, Lakes, and Lake Houses. Finally, we are getting to some fun topics after all the boring stuff. Today I want to talk about the two keys to success in your Data Lakes … Idempotency and Partitioning. I firmly believe these two concepts are the cornerstones of the new exciting, or not-so-exciting world of Data Lakes and Lake Houses, without which your data and pipelines go the way of the dodo.

February 9, 2022

Big Data, Data, Data Engineering, Data Warehousing

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

Is there any problem more classic to the Data Lakes and Data Warehouses than duplicate records? You would think after doing the same ETL for over a decade I could avoid the issue, apparently not. It’s good never to think too highly of one’s self, the duplicates can get us all. Today I want to talk about a wonderful feature of Databricks + Delta Lake MERGE statements that are perfect for quietly and insidiously injecting duplicates into your Data Warehouse or Data Lake. This is a great trick to play on your unsuspecting coworkers.

February 3, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

The testing never ends. Tests tests tests, and more tests. When it comes to data engineering and data pipelines it seems good practices are finally catching up after years. In the past, the data engineering community took a lot of heat, and rightly so, for not adopting good software engineering principles, especially in data pipelines.

In the defense of many data engineers, because of the varied backgrounds people come from, some were never taught or realized the importance of good software design and testing practices. Sure, it always “takes more time” upfront to design data pipelines with code that is functional and unit-testable, and worse, able to be integration tested from end to end. It requires some foresight and thought in both data architecture and pipeline design to enable complete testability.

Integration testing end-to-end in an automated manner is a tough nut to crack. How can you do such a thing on massive pipelines that crunch hundreds of TBs of data? With a little creativity.

February 1, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Now we are getting to the crux of the matter. I would say Data Modeling is probably one of the most unaddressed, yet important parts of Data Warehousing, Data Lakes, and Lake Houses. It raises the most questions and concerns and is responsible for the rise and fall of many Data Engineers.

This is what really drives the difference between the”big three”, Data Modeling.

January 17, 2022

DataFusion courtesy of Rust, vs Spark. Performance and other thoughts.

5 things I wish I knew about Databricks … before I started.

Exploring Delta Lake’s ZORDER, and Performance. On Databricks.

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

Reducing Complexity with Databricks + Delta Lake COPY INTO

Data Quality – Great Expectations for Data Engineers

Part 4 – Keys To Success – Idempotency and Partitioning.

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Interesting links

Pages

Categories

Archive