August 2025 - Confessions of a Data Guy

The Fastest Way to Insert Data to Postgres

I was recently working on a PySpark pipeline in which I was using the JDBC option to write about 22 million records from a Spark DataFrame into a Postgres RDS database. Hey, why not use the built in method provided by Spark, how bad could it be? I mean it’s not like the creators and maintainers of Spark aren’t probably our version of rocket engineers.

Well, a few hours later staring at my screen, I knew something had to change. Slower than your grandma on her way to the quilt shop.

August 30, 2025

AI, Big Data, Data, Data Engineering, Python

Polars on GPU: Blazing Fast DataFrames for Engineers

Did you know that Polars, that Rust based DataFrame tool that is one the fastest tools on the market today, just got faster?? There is now GPU execution on available on Polars that makes it 70% faster than before!!

August 28, 2025

Uncategorized

The Medallion Architecture Farce.

I can no longer hold the boiling and frothing mess of righteous anger that starts to rumble up from within me when I hear the words “Medallion Architecture” in the context of Data Modeling, especially when it’s used by some young Engineer who doesn’t know any better. Poor saps who have been born into a Databricks world where that fresh, supple mind has been polluted and twisted by the machinations of a marketing department.

Look, I am a daily user of Databricks, I have no axe to grind with them in particular. But the false gospel of the “Medallion Architecture” wreaked havoc on a generation of Data Engineers.

August 27, 2025

Uncategorized

DuckDB … Merge Mismatched CSV Schemas. (also testing Polars)

I recently encountered a problem loading a few hundred CSV files, which contained mismatched schemas due to a handful of “extra” columns. This turned out to be not an easy problem for Polars to solve, in all its Rust glory.

That made me curious: how does DuckDB handle mismatched schemas of CSV files?

Of course, this can be a tricky problem to solve, and every creator of a new Data Engineering tool probably has a different take on how this should be handled. The perfectionist will probably say … “Puke the whole thing, schemas should match exactly if you’re reading multiple files.” The realist, who has worked in data for many years, might say, “No, you simply need to at least give the option to MERGE the schemas on read.”

August 22, 2025

Uncategorized

polars.exceptions.ComputeError: schema lengths differ

So, you are happily using the new Rust GOAT dataframe tool Polars to mung messy data, maybe like me, messing with 40GBs of CSV data over multiple files. You are pretty much going to run into this error.

polars.exceptions.ComputeError: schema lengths differ
This error occurred with the following context stack:
[1] ‘csv scan’
[2] ‘select’

August 20, 2025

AI, Data, Data Engineering, Python

Becoming a Senior+ Engineer in the Age of AI

I don’t know about you, but I grew up and cut my teeth in what feels like a special and Golden age of software engineering that is now relegated to the history books, a true onetime Renaissance of coding that was beautiful, bright, full of laughter and wonder, a time which has passed and will never return.

Or will it?

August 15, 2025

Data, Data Engineering, SQL

What is SQLMesh and how is it different from dbt?

SQLMesh is an open-source framework for managing, versioning, and orchestrating SQL-based data transformations.
It’s in the same “data transformation” space as dbt, but with some important design and workflow differences.

What SQLMesh Is

SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and deploy data transformations written in SQL or Python with visibility and control at any size.

So … what you are telling me is that it’s dbt … but with Python? Interesting enough concept, I should say. One would have to surmise that most people using SQLMesh would be using … SQL! Look at how smart I am.

August 14, 2025

The Fastest Way to Insert Data to Postgres

Polars on GPU: Blazing Fast DataFrames for Engineers

The Medallion Architecture Farce.

DuckDB … Merge Mismatched CSV Schemas. (also testing Polars)

polars.exceptions.ComputeError: schema lengths differ

Becoming a Senior+ Engineer in the Age of AI

What is SQLMesh and how is it different from dbt?

What SQLMesh Is

Interesting links

Pages

Categories

Archive