Big Data Archives - Confessions of a Data Guy

Big Data, Data, Data Engineering, DuckDB, Python

DuckDB beats Polars for 1TB of data.

I’ve been a Polars bro for most of the last few years. Why? It’s Rust-based, fast, DataFrame-centric, just the way I like it. It also had the excellent feature, right from the start, of Lazy Execution. A few years ago, maybe two, I actually put Polars into production, running on Airflow, working with S3 and reading Delta Lake tables.

I was in love.

December 28, 2025

Big Data, Data Engineering

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

Recently, I had to migrate a few hundred Delta Lake tables that were partitioned over to Liquid Clustering. It seems straightforward on the surface, but everything usually does, until it isn’t. There was no rocket science involved here, but I did want to write this up to help the myriad of others who will probably want/need to move from partitioning to liquid clustering over the next few years.

Databricks recommends liquid clustering for all new Delta Lake tables. Based on past testing, liquid clustering indeed offers significant performance gains.

Again, the difference between a partitioned table and a liquid clustered table, in terms of DDL, is not very much, as you can see.

December 3, 2025

Big Data, Data, Data Engineering, DuckDB

Apache Iceberg Writes with DuckDB (or not)

Well, all the bottom feeders (Iceberg and DuckDB users) are howling at the moon and dancing around a bonfire at midnight trying to cast their evil spells on the rest of us. Apache Iceberg writes with DuckDB? Better late than never I suppose.

Your witchy ways won’t work on me.

Not going to lie, Iceberg writes with MotherDuck is an interesting concept. MotherDuck is lit and Iceberg only puts a little ice on the fire.

Many other tools like Polars or Daft have been offering Iceberg writes for ages now, it’s about time DuckDB preened its feathers and added write support. Up until now the DuckDB Iceberg Extension has all about the read. But, that is pretty much good for HelloWorld() crap pumped and dumped on Redditors.

We need write support in the real production world. Oh, and not on some Iceberg table stored on your laptop you ninny.

September 18, 2025

Big Data, Data, Data Engineering

How to tune Spark Shuffle Partitions.

So, you’re just a regular old Data Engineer crawling along through the data muck, barley keeping your head above the bits and bytes threatening to drown you. At point in time you were full of spit and vinegar and enjoyed understanding and playing with every nuance known to man.

But, not you are old and wizened, exhausted with the never ending stream of JIRA tickets from which you can never get ahead. You write lots of Spark jobs, consider yourself a PySpark pipeline writing expert … but when it comes to Spark performance tuning and optimizations? That’s for the birds.

Well my friend, don’t let all the Scala experts look down on you, scare you into thinking Spark performance is simply to complex for the common developer. Liars, every last mother one of them.

September 12, 2025

Big Data, Data, Data Engineering, Data Warehousing

Is Data Modeling Dead?

Ok, not going to lie, I rarely find anything of value in the dregs of r/dataengineering, mostly I fear, because it’s %90 freshers with little to no experience. These green behind the ear know-it-all engineers who’ve never written a line of Perl, SSH’d into a server, and have no idea what a LAMP stack is. Weak. Sad.

We used to program our way to glory, up hill both ways in the snow. All you do is script kiddy some Python code through Cursor.

A recent post on Data Modeling, specifically that data modeling is dead, caught my eye. A rare piece of gold mixed in the usual pile of crap. It some truth being spoken on the interwebs, hold onto your panties you bright eyed data zealot. I agree %100 with this sentiment.

DATA MODELING IS DEAD.

September 8, 2025

AI, Big Data, Data, Data Engineering, Python

Polars on GPU: Blazing Fast DataFrames for Engineers

Did you know that Polars, that Rust based DataFrame tool that is one the fastest tools on the market today, just got faster?? There is now GPU execution on available on Polars that makes it 70% faster than before!!

August 28, 2025

Big Data, Data, Data Engineering, Data Warehousing, SQL

Duplicates in Data and SQL

You know, after literally multiple decades in the data space, writing code and SQL, at some point along that arduous journey, one might think this problem would be solved by me, or the tooling … yet alas, not to be.

Regardless of the industry or tools used, such as Pandas, Spark, or Postgres, duplicates are a common issue in pipelines, and SQL remains the most classic and iconic problem. Things just never change, and humans never learn their lessons, at least I don’t.

July 29, 2025

Big Data, Data, Data Engineering

What Are Deletion Vectors (DLV)?

Deletion Vectors are a soft‑delete mechanism in Delta Lake that enables Merge‑on‑Read (MoR) behavior, letting update/delete/merge operations mark row positions as removed without rewriting the underlying Parquet files. This contrasts with the older Copy‑on‑Write (CoW) model, where even a single deleted record triggers rewriting of entire files YouTube+8docs.delta.io+8Medium+8.

Supported since Delta Lake 2.3 (read-only), full deletion vector support for DELETE/UPDATE/MERGE appeared in later versions: DELETE in 2.4, UPDATE/MERGE in Delta 3.x Miles Cole+4docs.delta.io+4delta.io+4.

✅ Why Use Deletion Vectors?

Faster small changes: Only binary bitmap metadata is written, rather than rewriting large Parquet files.
Write efficiency: Particularly efficient when changes affect sparse rows across many files Medium+11delta.io+11Towards AI+11.
ACID semantics preserved: Readers still get the correct view by merging with DLV metadata at read time.

However:

Read-time overhead: Filtering DLV metadata adds overhead during queries.
Maintenance needed: Unapplied deletion markers build up until compaction or purge Medium+6japila-books+6Towards AI+6.

July 27, 2025

Big Data, Data, Data Engineering

Who is Astronomer.io … and what do they do?

So … Astronomer.io … who are they and what do they do?

It’s funny how, every once in a while, the Data Engineering world gets dragged into the light of the real world … usually for bad things … and then gets shoved under the carpet again. Recently, because of the transgressions of the CEO of Astronomer, a little side fling at a Coldplay concert that went viral, Astronomer has popped into the spotlight.

I’ve been around Astronomer for a long time, so I will give you the lowdown on who they are, and how they become a billion, yes Billion, dollar company that you’ve never heard of.

July 21, 2025

Big Data, Data Engineering, Data Warehousing

dbt on Databricks

Running dbt on Databricks has never been easier. The integration between dbtcore and Databricks could not be more simple to set up and run. Wondering how to approach running dbt models on Databricks with SparkSQL? Watch the tutorial below.

March 28, 2025

DuckDB beats Polars for 1TB of data.

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

Apache Iceberg Writes with DuckDB (or not)

How to tune Spark Shuffle Partitions.

Is Data Modeling Dead?

Polars on GPU: Blazing Fast DataFrames for Engineers

Duplicates in Data and SQL

What Are Deletion Vectors (DLV)?

✅ Why Use Deletion Vectors?

Who is Astronomer.io … and what do they do?

dbt on Databricks

Interesting links

Pages

Categories

Archive