I don’t know about you, but I grew up and cut my teeth in what feels like a special and Golden age of software engineering that is now relegated to the history books, a true onetime Renaissance of coding that was beautiful, bright, full of laughter and wonder, a time which has passed and will […]

SQLMesh is an open-source framework for managing, versioning, and orchestrating SQL-based data transformations. It’s in the same “data transformation” space as dbt, but with some important design and workflow differences. What SQLMesh Is SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and […]

You know, after literally multiple decades in the data space, writing code and SQL, at some point along that arduous journey, one might think this problem would be solved by me, or the tooling … yet alas, not to be. Regardless of the industry or tools used, such as Pandas, Spark, or Postgres, duplicates are […]

Deletion Vectors are a soft‑delete mechanism in Delta Lake that enables Merge‑on‑Read (MoR) behavior, letting update/delete/merge operations mark row positions as removed without rewriting the underlying Parquet files. This contrasts with the older Copy‑on‑Write (CoW) model, where even a single deleted record triggers rewriting of entire files YouTube+8docs.delta.io+8Medium+8. Supported since Delta Lake 2.3 (read-only), full […]

I recently used Polars … inside an AWS Lambda … to fill a novel and somewhat obtuse CSV formatting issue. We were receiving CSV files that contained rows with specific columns that were empty because the following values matched the first one, until a different value finally appeared. Let me show you.

So … Astronomer.io … who are they and what do they do? It’s funny how, every once in a while, the Data Engineering world gets dragged into the light of the real world … usually for bad things … and then gets shoved under the carpet again. Recently, because of the transgressions of the CEO […]

The future never shows up quietly. Just when you think you’ve tamed the latest “must-have” technology, a fresh acronym crashes the party. I’d barely finished wrapping my head around the Lakehouse paradigm when Databricks rolled out something new at the 2025 Data & AI Summit: Lakebase, a fully managed PostgreSQL engine built directly into the […]

I’d be lying if I said a small part of me didn’t groan when I first read about SQL Scripting being released by Databricks. Don’t get me wrong—I don’t fault Databricks for giving users what they want. After all, if you don’t feed the masses, they’ll turn on you. We data engineers are gluttons for […]

I’ve been thinking about this for a few days now, and I still don’t know whether to cheer or groan. Some moments, I see DuckLake as a smart, much-needed evolution; other times, it feels like just another unnecessary entry in the ever-growing Lake House jungle. Reality, as always, is probably somewhere in between. MotherDuck and […]

Let’s be honest: working with Apache Iceberg stops being fun the moment you step off your local laptop and into anything that resembles production. The catalog system—mandatory and rigid—has long been the Achilles’ heel of an otherwise promising open data format. For a long time, you had two options: over-engineered corporate-grade solutions that require infrastructure […]