6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Databricks, easily the hotest tool these days for Data Lakes and Data Warehousing, it’s a beast. As with any new technology there are always growing pains, learnings, and tips and tricks that might not be obvious to those dipping their toes in the water. Not understand certain concepts, and being unware of specific configurations can cost you time and money very easily when running large ETL pipelines on Databricks.

I want to share 7 tips for Databricks newbies, and oldies, that are foundational to good Data Engineering architecture, affecting both performance and cost.

Read more

Data Modeling – Relational Databases (SQL) vs Data Lake (File Based)

Data Modeling is a topic that never goes away. Sometimes I do reminisce about the good ol’ days of Kimball-style data models, it was so simple, straightforward, just the same thing for years. Then Big Data happened, Spark happened. Things just changed. There is a lot of new content coming out around Data Lakes and data modeling, but it still seems like a fluid topic, with nothing as concrete as the classic Data Warehouse toolkit.

Oh, what to do what to do. I do believe there are a few key ideas and points to being successful with file-based Data Lake modeling. I think it’s a mistake to fully embrace the classic Kimball-style Data Warehouse approach. It really comes down to Relational Database SQL vs File-Based data models are going to be different, for technical and practical reasons.

Read more

Review of Airbyte for Data Engineers

It’s hard to keep up with the never-ending stream of new Data Engineering tools these days. Always something new around the next bend. I find it interesting to kick the tries on the new kids on the block. It’s always interesting to see what angle or pain point a new tool tries to hone in on. I mean if you think about Data Engineering in general, the fundamentals really haven’t changed that much over the years, the tools change, but what we do hasn’t. We are expected to move data from point A to point B in a reliable, scalable, and efficient manner.

Today I’m going to be reviewing a tool called Airbyte. When I review a new product I’m usually incredibly basic about what I look for and I try to answer some easy and obvious questions. How easy is it to set up and use? What does the documentation look like? When I run into a problem can I solve it? Is the overhead of adding this new tool to a tech stack worth what features it offers? This is how we will explore Airbyte.

Read more

Bitwise Operations for Data Engineers

Ugh. Cursed bitwise operations … something usually reserved for the low-level mythical engineers writing code no one should have to write. I’ve escaped all but twice during my meager existence, recently I had to use a bitwise operation while converting a Python hashing algorithm into PySpark code. It made my brain hurt. What is this wizardry all about anyways? It got me thinking, I should really attempt to learn something about bitwise operations since it comes up once every 10 years.

Read more