Ok, not going to lie, I rarely find anything of value in the dregs of r/dataengineering, mostly I fear, because it’s %90 freshers with little to no experience. These green behind the ear know-it-all engineers who’ve never written a line of Perl, SSH’d into a server, and have no idea what a LAMP stack is. Weak. Sad.

We used to program our way to glory, up hill both ways in the snow. All you do is script kiddy some Python code through Cursor.

A recent post on Data Modeling, specifically that data modeling is dead, caught my eye. A rare piece of gold mixed in the usual pile of crap. It some truth being spoken on the interwebs, hold onto your panties you bright eyed data zealot. I agree %100 with this sentiment.

DATA MODELING IS DEAD.

Read more

Did you know that Polars, that Rust based DataFrame tool that is one the fastest tools on the market today, just got faster?? There is now GPU execution on available on Polars that makes it 70% faster than before!!

Read more

I don’t know about you, but I grew up and cut my teeth in what feels like a special and Golden age of software engineering that is now relegated to the history books, a true onetime Renaissance of coding that was beautiful, bright, full of laughter and wonder, a time which has passed and will never return.

Or will it?

Read more

SQLMesh is an open-source framework for managing, versioning, and orchestrating SQL-based data transformations.
It’s in the same “data transformation” space as dbt, but with some important design and workflow differences.


What SQLMesh Is

SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and deploy data transformations written in SQL or Python with visibility and control at any size.

So … what you are telling me is that it’s dbt … but with Python? Interesting enough concept, I should say. One would have to surmise that most people using SQLMesh would be using … SQL! Look at how smart I am.

Read more

You know, after literally multiple decades in the data space, writing code and SQL, at some point along that arduous journey, one might think this problem would be solved by me, or the tooling … yet alas, not to be.

Regardless of the industry or tools used, such as Pandas, Spark, or Postgres, duplicates are a common issue in pipelines, and SQL remains the most classic and iconic problem. Things just never change, and humans never learn their lessons, at least I don’t.

Read more

Deletion Vectors are a soft‑delete mechanism in Delta Lake that enables Merge‑on‑Read (MoR) behavior, letting update/delete/merge operations mark row positions as removed without rewriting the underlying Parquet files. This contrasts with the older Copy‑on‑Write (CoW) model, where even a single deleted record triggers rewriting of entire files YouTube+8docs.delta.io+8Medium+8.

Supported since Delta Lake 2.3 (read-only), full deletion vector support for DELETE/UPDATE/MERGE appeared in later versions: DELETE in 2.4, UPDATE/MERGE in Delta 3.x Miles Cole+4docs.delta.io+4delta.io+4.


✅ Why Use Deletion Vectors?

  • Faster small changes: Only binary bitmap metadata is written, rather than rewriting large Parquet files.

  • Write efficiency: Particularly efficient when changes affect sparse rows across many files Medium+11delta.io+11Towards AI+11.

  • ACID semantics preserved: Readers still get the correct view by merging with DLV metadata at read time.

However:

  • Read-time overhead: Filtering DLV metadata adds overhead during queries.

  • Maintenance needed: Unapplied deletion markers build up until compaction or purge Medium+6japila-books+6Towards AI+6.

Read more

So … Astronomer.io … who are they and what do they do?

It’s funny how, every once in a while, the Data Engineering world gets dragged into the light of the real world … usually for bad things … and then gets shoved under the carpet again. Recently, because of the transgressions of the CEO of Astronomer, a little side fling at a Coldplay concert that went viral, Astronomer has popped into the spotlight.

I’ve been around Astronomer for a long time, so I will give you the lowdown on who they are, and how they become a billion, yes Billion, dollar company that you’ve never heard of.

Read more

Every so often, I have to convert some .txt or .csv file over to Excel format … just because that’s how the business wants to consume or share the data. It is what it is. This means I am often on the lookup for some easy to use, simple, one-liners that I can use to do just that.

Read more

There are things in life that are satisfying—like a clean DAG run, a freshly brewed cup of coffee, or finally deleting 400 lines of YAML. Then there are things that make you question your life choices. Enter: setting up Apache Polaris (incubating) as an Apache Iceberg REST catalog.

Let’s get one thing out of the way—I didn’t want to do this.

Read more

I make it my duty in life to never have to open an Excel file (xlsx); I feel like if I do, then I made a critical error in my career trajectory. But, I recently had no choice but to open an Excel on a Mac (or try) to look at some sample data from a client.

Read more