December 2025 - Confessions of a Data Guy

Big Data, Data, Data Engineering, DuckDB, Python

DuckDB beats Polars for 1TB of data.

I’ve been a Polars bro for most of the last few years. Why? It’s Rust-based, fast, DataFrame-centric, just the way I like it. It also had the excellent feature, right from the start, of Lazy Execution. A few years ago, maybe two, I actually put Polars into production, running on Airflow, working with S3 and reading Delta Lake tables.

I was in love.

Scott Haines on the Future of Data Engineering

December 17, 2025

AI, Data Engineering

The Age of Agentic AI | LangChain and LangGraph style.

It’s a fast-paced and ever-changing world we live in; nothing we can do about it. I grew up in the middle of the prairie, when the internet became mainstream, the age of Doom, Myst, MSN Messenger, Yahoo Pool, and that irreplaceable Goldeneye, let’s be honest, World of Warcraft on a PC was game-changing. I suppose you could chalk up half my feelings as nostalgia and old-person hum-drum, I won’t deny it.

I see the current Agentic AI confusion in the software community as something similar to the old days when I split my time between being a river rat and playing Battlefield 1942 all night long, enraptured by new tech, yet drawn to the old ways.

PySpark vs DuckDB vs Polars: The Results

December 16, 2025

Uncategorized

The Evolution of Databricks Compute – Serverless is Winning

December 16, 2025

Uncategorized

Parquet Killer? Introduction to the Lance File Format.

December 3, 2025

Uncategorized

SQL Data Modeling with One Big Table (OBT)

December 3, 2025

Big Data, Data Engineering

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

Recently, I had to migrate a few hundred Delta Lake tables that were partitioned over to Liquid Clustering. It seems straightforward on the surface, but everything usually does, until it isn’t. There was no rocket science involved here, but I did want to write this up to help the myriad of others who will probably want/need to move from partitioning to liquid clustering over the next few years.

Databricks recommends liquid clustering for all new Delta Lake tables. Based on past testing, liquid clustering indeed offers significant performance gains.

Again, the difference between a partitioned table and a liquid clustered table, in terms of DDL, is not very much, as you can see.

DuckDB beats Polars for 1TB of data.

Scott Haines on the Future of Data Engineering

The Age of Agentic AI | LangChain and LangGraph style.

PySpark vs DuckDB vs Polars: The Results

The Evolution of Databricks Compute – Serverless is Winning

Parquet Killer? Introduction to the Lance File Format.

SQL Data Modeling with One Big Table (OBT)

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

Interesting links

Pages

Categories

Archive