Home - Confessions of a Data Guy

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

Recently, I had to migrate a few hundred Delta Lake tables that were partitioned over to Liquid Clustering. It seems straightforward on the surface, but everything usually does, until it isn’t. There was no rocket science involved here, but I did want to write this up to help the myriad of others who will probably […]

December 3, 2025

Data, Data Engineering, Python

PyArrow for Large Dataset Processing

Over the last few years, I’ve found myself using PyArrow more and more for everyday data engineering things. Data ingestion, reading, and writing from various data sources and sinks. Most of us are familiar with Arrow and how it underpins a lot of new tech like DataFusion, and Arrow is used as an internal […]

November 14, 2025

Data, Data Engineering, Python

Lazy Execution with Polars and DuckDB

Something I’ve taken for granted for a long time, and then suddenly discover others are discovering for the first time, leaves me a little baffled. It makes me wonder how many folks are living under a proverbial rock. Recently, I saw a post on that infamous LinkedIn about someone excited about Polars’ lazy execution.

November 13, 2025

Uncategorized

Cluster Fatigue. Polars and PyArrow to Postgres and Apache Iceberg (streaming mode)

I’ve been working lately, on moving expensive distributed compute jobs (that don’t need to be distributed) from Spark, to other single node tools and frameworks. To be honest, there are reasons that Data Platforms might pick Spark, for example, and just keep everything on Spark, even if it doesn’t need Spark. Yes, it costs more. […]

November 4, 2025

Uncategorized

Introduction to Databricks Asset Bundles.

Need a gentle introduction to Databricks Asset Bundles, what they are, how to use them, why to use them? Look no further you hobbit.

October 21, 2025

Uncategorized

Fivetran buys DBT. People get mad.

Well, we all knew that open source wasn’t a real thing anymore. This just confirms it. I don’t use DBT much, I think it’s for whimps and script kiddies. Anywho, I love watching Linkedin and Reddit explode with anger at Fivetran buying DBT. Everyone things dbt core is done. Who cares. Babies.

October 21, 2025

AI, Data Engineering

Running Llama 3.1 8B Locally (LangChain and SQLite)

Things have changed a lot in the last year related to LLMs and AI; on the one hand, it seems the AI skeptics for coding are increasingly confined to the corners of the internet. Everyone is dancing around in the middle, not sure of where everything should fall. Clearly, if we don’t use AI at […]

October 13, 2025

Uncategorized

DuckDB on Mother Duck. Lake House Glory?

October 9, 2025

Uncategorized

The Semantic Layer – Real or Not?

October 9, 2025

Uncategorized

The Era of the YAML Engineer

You know, I did fight it for a long time, and I’m still fighting it. Look, no one wants to become a Terraform engineer; that is pain and suffering. But, we all understand the benefits of IAC (infrastructure as code), and SHOULD be using it in our daily tech lives, or pushing towards it. But […]

October 3, 2025

Migrate (hundreds) Delta Lake Partitioned Tables to Liquid Clustering

PyArrow for Large Dataset Processing

Lazy Execution with Polars and DuckDB

Cluster Fatigue. Polars and PyArrow to Postgres and Apache Iceberg (streaming mode)

Introduction to Databricks Asset Bundles.

Fivetran buys DBT. People get mad.

Running Llama 3.1 8B Locally (LangChain and SQLite)

DuckDB on Mother Duck. Lake House Glory?

The Semantic Layer – Real or Not?

The Era of the YAML Engineer

Interesting links

Pages

Categories

Archive