You know, after literally multiple decades in the data space, writing code and SQL, at some point along that arduous journey, one might think this problem would be solved by me, or the tooling … yet alas, not to be.

Regardless of the industry or tools used, such as Pandas, Spark, or Postgres, duplicates are a common issue in pipelines, and SQL remains the most classic and iconic problem. Things just never change, and humans never learn their lessons, at least I don’t.

Read more

Deletion Vectors are a soft‑delete mechanism in Delta Lake that enables Merge‑on‑Read (MoR) behavior, letting update/delete/merge operations mark row positions as removed without rewriting the underlying Parquet files. This contrasts with the older Copy‑on‑Write (CoW) model, where even a single deleted record triggers rewriting of entire files YouTube+8docs.delta.io+8Medium+8.

Supported since Delta Lake 2.3 (read-only), full deletion vector support for DELETE/UPDATE/MERGE appeared in later versions: DELETE in 2.4, UPDATE/MERGE in Delta 3.x Miles Cole+4docs.delta.io+4delta.io+4.


✅ Why Use Deletion Vectors?

  • Faster small changes: Only binary bitmap metadata is written, rather than rewriting large Parquet files.

  • Write efficiency: Particularly efficient when changes affect sparse rows across many files Medium+11delta.io+11Towards AI+11.

  • ACID semantics preserved: Readers still get the correct view by merging with DLV metadata at read time.

However:

  • Read-time overhead: Filtering DLV metadata adds overhead during queries.

  • Maintenance needed: Unapplied deletion markers build up until compaction or purge Medium+6japila-books+6Towards AI+6.

Read more

I recently used Polars … inside an AWS Lambda … to fill a novel and somewhat obtuse CSV formatting issue.

We were receiving CSV files that contained rows with specific columns that were empty because the following values matched the first one, until a different value finally appeared.

Let me show you.

Read more

So … Astronomer.io … who are they and what do they do?

It’s funny how, every once in a while, the Data Engineering world gets dragged into the light of the real world … usually for bad things … and then gets shoved under the carpet again. Recently, because of the transgressions of the CEO of Astronomer, a little side fling at a Coldplay concert that went viral, Astronomer has popped into the spotlight.

I’ve been around Astronomer for a long time, so I will give you the lowdown on who they are, and how they become a billion, yes Billion, dollar company that you’ve never heard of.

Read more