I was recently working on a PySpark pipeline in which I was using the JDBC option to write about 22 million records from a Spark DataFrame into a Postgres RDS database. Hey, why not use the built in method provided by Spark, how bad could it be? I mean it’s not like the creators and maintainers of Spark aren’t probably our version of rocket engineers.

Well, a few hours later staring at my screen, I knew something had to change. Slower than your grandma on her way to the quilt shop.

Read more

Did you know that Polars, that Rust based DataFrame tool that is one the fastest tools on the market today, just got faster?? There is now GPU execution on available on Polars that makes it 70% faster than before!!

Read more

I can no longer hold the boiling and frothing mess of righteous anger that starts to rumble up from within me when I hear the words “Medallion Architecture” in the context of Data Modeling, especially when it’s used by some young Engineer who doesn’t know any better. Poor saps who have been born into a Databricks world where that fresh, supple mind has been polluted and twisted by the machinations of a marketing department.

Look, I am a daily user of Databricks, I have no axe to grind with them in particular. But the false gospel of the “Medallion Architecture” wreaked havoc on a generation of Data Engineers.

Read more

I recently encountered a problem loading a few hundred CSV files, which contained mismatched schemas due to a handful of “extra” columns. This turned out to be not an easy problem for Polars to solve, in all its Rust glory.

That made me curious: how does DuckDB handle mismatched schemas of CSV files?

Of course, this can be a tricky problem to solve, and every creator of a new Data Engineering tool probably has a different take on how this should be handled. The perfectionist will probably say … “Puke the whole thing, schemas should match exactly if you’re reading multiple files.” The realist, who has worked in data for many years, might say, “No, you simply need to at least give the option to MERGE the schemas on read.

Read more

So, you are happily using the new Rust GOAT dataframe tool Polars to mung messy data, maybe like me, messing with 40GBs of CSV data over multiple files. You are pretty much going to run into this error.

polars.exceptions.ComputeError: schema lengths differ
This error occurred with the following context stack:
[1] ‘csv scan’
[2] ‘select’

Read more

I don’t know about you, but I grew up and cut my teeth in what feels like a special and Golden age of software engineering that is now relegated to the history books, a true onetime Renaissance of coding that was beautiful, bright, full of laughter and wonder, a time which has passed and will never return.

Or will it?

Read more

SQLMesh is an open-source framework for managing, versioning, and orchestrating SQL-based data transformations.
It’s in the same “data transformation” space as dbt, but with some important design and workflow differences.


What SQLMesh Is

SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and deploy data transformations written in SQL or Python with visibility and control at any size.

So … what you are telling me is that it’s dbt … but with Python? Interesting enough concept, I should say. One would have to surmise that most people using SQLMesh would be using … SQL! Look at how smart I am.

Read more