Recently, I had to migrate a few hundred Delta Lake tables that were partitioned over to Liquid Clustering. It seems straightforward on the surface, but everything usually does, until it isn’t. There was no rocket science involved here, but I did want to write this up to help the myriad of others who will probably […]

  Over the last few years, I’ve found myself using PyArrow more and more for everyday data engineering things. Data ingestion, reading, and writing from various data sources and sinks. Most of us are familiar with Arrow and how it underpins a lot of new tech like DataFusion, and Arrow is used as an internal […]

Something I’ve taken for granted for a long time, and then suddenly discover others are discovering for the first time, leaves me a little baffled. It makes me wonder how many folks are living under a proverbial rock. Recently, I saw a post on that infamous LinkedIn about someone excited about Polars’ lazy execution.

I’ve been working lately, on moving expensive distributed compute jobs (that don’t need to be distributed) from Spark, to other single node tools and frameworks. To be honest, there are reasons that Data Platforms might pick Spark, for example, and just keep everything on Spark, even if it doesn’t need Spark. Yes, it costs more. […]

Need a gentle introduction to Databricks Asset Bundles, what they are, how to use them, why to use them? Look no further you hobbit.

Well, we all knew that open source wasn’t a real thing anymore. This just confirms it. I don’t use DBT much, I think it’s for whimps and script kiddies. Anywho, I love watching Linkedin and Reddit explode with anger at Fivetran buying DBT. Everyone things dbt core is done. Who cares. Babies.