DuckDB has MAJOR Problems! OOM Errors.

I recently did a challenge. The results were clear. DuckDB CANNOT handle larger-than-memory datasets. OOM Errors.  See link below for more details.

DuckDB vs Polars – Thunderdome. 16GB on 4GB machine Challenge. 

 

AWS Lambdas. Useful for Data Engineering?

Are lambdas one of those tools that everyone uses and no one talks about? I guess I’ve taken them for granted over the years, even though they are incredibly useful. For a lot of my Data Engineering career I didn’t really think about or use AWS lambdas, I just saw them as little annoying flies on the wall, incapable of “real” use in Data Engineering pipelines.

But, I’ve changed my evil ways, and come to love the little buggers. Easy to use, cheap, and no infrastructure to worry about, I mean, they are little jewels in the rough.

I guess now that I look back and reflect, I’m surprised that I don’t see more content produced singing the praises and glories of AWS lambdas, why is the world silent? Those workhorses chug along in the background, millions and millions of times every second.

Today will not be earth-shattering, just a 10,000-foot view of AWS lambdas to inspire you to use them. With a focus on their use in Data Engineering.

Read more

5 git Commands your Grandma uses.

GitHub’s CoPilot Writes Data Pipelines

Why I both Love and Hate LeetCode

There are a few things in life I both love and hate. Let’s see …. hot weather, cold weather, working for a living, and …. LeetCode. I mean it is totally fun to push yourself and try to solve hard problems, but then the other side of me is like … well I’ve been writing code for years and 80% of this stuff is nothing like writing code in real life. I think the LeetCode platform itself is an amazing tool, and has provided both people and companies with an elegant way to showcase and practice skills. But is there too much of a good thing? Of course.

Read more

Intro to Apache Beam for Data Engineers

Apache Beam for Data Engineers.

What is this thing? What’s it good for? Who’s using it and why? That’s pretty much what I ask myself once a month when I actually see the name Apache Beam pop up in some feed I’m scrolling through. I figured it has to be legit to be Apache incubated, but I’ve never run across anyone in the wild using it yet. On the surface it appears to be semi-pointless since it runs on-top of other distributed systems like Spark, but I’m sure there is more to it. Today, I’m going to run through an overview of Apache Beam and then try installing and running some data through it, kick the tires as it were. And see if my mind changes about the pointless bit.

Read more

The Battlefield of the Data Engineer.

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a good Data Engineer. Just the life of a Data Engineer in general. The Battlefield of the Data Engineer is fought in the labyrinth of nested SQL queries. It rages to the depths of distributed computing clusters. It vies for victory on the crags and peaks of DevOps. It attacks for precious ground amid the chaos of the perfect OOP and Functional code bases. Phew… and all that just to keep your head above water.

Read more

Big Data File Showdown – Avro vs Parquet with Python.

Apache Parquet vs Apache Avro

There comes a point in the life of every data person that we have to graduate from csv files. At a certain point the data becomes big enough or we hear talk on the street about other file formats. Apache Parquet and Apache Avro are two of those formats that been coming up more with the rise of distributed data processing engines like Spark.

Read more

Apache Airflow for Data Engineers

On again, off again. I feel like that is the best way to describe Apache Airflow. It started out around 2014 at Airbnb and has been steadily gaining traction and usage ever since, albeit slowly. I still believe that Airflow is very underutilized in the data engineering community as a whole, most everyone has heard of it, but it’s usage seems to be sporadic at best. I’m going to talk about what makes Apache Airflow the perfect tool for any Data Engineer, and show you how you can use it to great effect while not committing to it completely.

Read more

You Have to Try This… from io import StringIO, BytesIO

StringIO and BytesIO are perfect for making your Python faster.

Ever heard of something called a File Object in Python? Ever heard of BytesIO or StringIO? Your missing out. It’s easy, fast, and wonderful, in short, it’s the best. For some reason IO streams are a totally underused feature that rarely comes up in most code. We all know that memory if faster than disk IO, this is what I use IO streams for.

Read more