Home - Confessions of a Data Guy

Uncategorized

PySpark vs DuckDB vs Polars: The Results

December 16, 2025

Uncategorized

The Evolution of Databricks Compute – Serverless is Winning

December 16, 2025

Uncategorized

Parquet Killer? Introduction to the Lance File Format.

December 3, 2025

Uncategorized

SQL Data Modeling with One Big Table (OBT)

December 3, 2025

Big Data, Data Engineering

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

Recently, I had to migrate a few hundred Delta Lake tables that were partitioned over to Liquid Clustering. It seems straightforward on the surface, but everything usually does, until it isn’t. There was no rocket science involved here, but I did want to write this up to help the myriad of others who will probably […]

December 3, 2025

Data, Data Engineering, Python

PyArrow for Large Dataset Processing

Over the last few years, I’ve found myself using PyArrow more and more for everyday data engineering things. Data ingestion, reading, and writing from various data sources and sinks. Most of us are familiar with Arrow and how it underpins a lot of new tech like DataFusion, and Arrow is used as an internal […]

November 14, 2025

Data, Data Engineering, Python

Lazy Execution with Polars and DuckDB

Something I’ve taken for granted for a long time, and then suddenly discover others are discovering for the first time, leaves me a little baffled. It makes me wonder how many folks are living under a proverbial rock. Recently, I saw a post on that infamous LinkedIn about someone excited about Polars’ lazy execution.

November 13, 2025

Uncategorized

Cluster Fatigue. Polars and PyArrow to Postgres and Apache Iceberg (streaming mode)

I’ve been working lately, on moving expensive distributed compute jobs (that don’t need to be distributed) from Spark, to other single node tools and frameworks. To be honest, there are reasons that Data Platforms might pick Spark, for example, and just keep everything on Spark, even if it doesn’t need Spark. Yes, it costs more. […]

November 4, 2025

Uncategorized

Introduction to Databricks Asset Bundles.

Need a gentle introduction to Databricks Asset Bundles, what they are, how to use them, why to use them? Look no further you hobbit.

October 21, 2025

Uncategorized

Fivetran buys DBT. People get mad.

Well, we all knew that open source wasn’t a real thing anymore. This just confirms it. I don’t use DBT much, I think it’s for whimps and script kiddies. Anywho, I love watching Linkedin and Reddit explode with anger at Fivetran buying DBT. Everyone things dbt core is done. Who cares. Babies.

October 21, 2025

PySpark vs DuckDB vs Polars: The Results

The Evolution of Databricks Compute – Serverless is Winning

Parquet Killer? Introduction to the Lance File Format.

SQL Data Modeling with One Big Table (OBT)

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

PyArrow for Large Dataset Processing

Lazy Execution with Polars and DuckDB

Cluster Fatigue. Polars and PyArrow to Postgres and Apache Iceberg (streaming mode)

Introduction to Databricks Asset Bundles.

Fivetran buys DBT. People get mad.

Interesting links

Pages

Categories

Archive