Data Engineering Archives - Page 4 of 27 - Confessions of a Data Guy

PyArrow for Large Dataset Processing

Over the last few years, I’ve found myself using PyArrow more and more for everyday data engineering things. Data ingestion, reading, and writing from various data sources and sinks. Most of us are familiar with Arrow and how it underpins a lot of new tech like DataFusion, and Arrow is used as an internal memory format.

But, small and mighty though it might be, the pyarrow Python package is a force to be reckoned with. Capable of blasting through all sorts of cloud-based datasets. It’s not particularly a data transformation framework, as much as a way to represent core datasets, transferring data hither and thither over the wire from one format to another.

November 14, 2025

Data, Data Engineering, Python

Lazy Execution with Polars and DuckDB

Something I’ve taken for granted for a long time, and then suddenly discover others are discovering for the first time, leaves me a little baffled. It makes me wonder how many folks are living under a proverbial rock. Recently, I saw a post on that infamous LinkedIn about someone excited about Polars’ lazy execution.

November 13, 2025

AI, Data Engineering

Running Llama 3.1 8B Locally (LangChain and SQLite)

Things have changed a lot in the last year related to LLMs and AI; on the one hand, it seems the AI skeptics for coding are increasingly confined to the corners of the internet. Everyone is dancing around in the middle, not sure of where everything should fall. Clearly, if we don’t use AI at all, we will become coding dinosaurs. But a sea of junior devs relying too much on Cursor has created a knowledge crisis, and demand for Senior+ devs has skyrocketed.

October 13, 2025

Data, Data Engineering, DuckDB, Python

DuckDB vs Polars. Wait. DuckDB and Polars.

So, the classic newbie question. DuckDB vs Polars, which one should you pick?

This is an interesting question, and actually drives a lot of search traffic to this website on which you find yourself wasting time. I thank you for that.

This is probably the most classic type of question that all developers eventually ask at some point in their sad and depressing lives. Isn’t that the same story that is as old as time? This stick is better than that rock. Rust is better than C. Databricks better than Snowflake. You know, Delta Lake better than Iceberg.

And so the world keeps turning and grinding away.

DuckDB vs Polars? That’s the wrong question.

September 25, 2025

Big Data, Data, Data Engineering, DuckDB

Apache Iceberg Writes with DuckDB (or not)

Well, all the bottom feeders (Iceberg and DuckDB users) are howling at the moon and dancing around a bonfire at midnight trying to cast their evil spells on the rest of us. Apache Iceberg writes with DuckDB? Better late than never I suppose.

Your witchy ways won’t work on me.

Not going to lie, Iceberg writes with MotherDuck is an interesting concept. MotherDuck is lit and Iceberg only puts a little ice on the fire.

Many other tools like Polars or Daft have been offering Iceberg writes for ages now, it’s about time DuckDB preened its feathers and added write support. Up until now the DuckDB Iceberg Extension has all about the read. But, that is pretty much good for HelloWorld() crap pumped and dumped on Redditors.

We need write support in the real production world. Oh, and not on some Iceberg table stored on your laptop you ninny.

September 18, 2025

Big Data, Data, Data Engineering

How to tune Spark Shuffle Partitions.

So, you’re just a regular old Data Engineer crawling along through the data muck, barley keeping your head above the bits and bytes threatening to drown you. At point in time you were full of spit and vinegar and enjoyed understanding and playing with every nuance known to man.

But, not you are old and wizened, exhausted with the never ending stream of JIRA tickets from which you can never get ahead. You write lots of Spark jobs, consider yourself a PySpark pipeline writing expert … but when it comes to Spark performance tuning and optimizations? That’s for the birds.

Well my friend, don’t let all the Scala experts look down on you, scare you into thinking Spark performance is simply to complex for the common developer. Liars, every last mother one of them.

September 12, 2025

Big Data, Data, Data Engineering, Data Warehousing

Is Data Modeling Dead?

Ok, not going to lie, I rarely find anything of value in the dregs of r/dataengineering, mostly I fear, because it’s %90 freshers with little to no experience. These green behind the ear know-it-all engineers who’ve never written a line of Perl, SSH’d into a server, and have no idea what a LAMP stack is. Weak. Sad.

We used to program our way to glory, up hill both ways in the snow. All you do is script kiddy some Python code through Cursor.

A recent post on Data Modeling, specifically that data modeling is dead, caught my eye. A rare piece of gold mixed in the usual pile of crap. It some truth being spoken on the interwebs, hold onto your panties you bright eyed data zealot. I agree %100 with this sentiment.

DATA MODELING IS DEAD.

September 8, 2025

AI, Big Data, Data, Data Engineering, Python

Polars on GPU: Blazing Fast DataFrames for Engineers

Did you know that Polars, that Rust based DataFrame tool that is one the fastest tools on the market today, just got faster?? There is now GPU execution on available on Polars that makes it 70% faster than before!!

August 28, 2025

AI, Data, Data Engineering, Python

Becoming a Senior+ Engineer in the Age of AI

I don’t know about you, but I grew up and cut my teeth in what feels like a special and Golden age of software engineering that is now relegated to the history books, a true onetime Renaissance of coding that was beautiful, bright, full of laughter and wonder, a time which has passed and will never return.

Or will it?

August 15, 2025

Data, Data Engineering, SQL

What is SQLMesh and how is it different from dbt?

SQLMesh is an open-source framework for managing, versioning, and orchestrating SQL-based data transformations.
It’s in the same “data transformation” space as dbt, but with some important design and workflow differences.

What SQLMesh Is

SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and deploy data transformations written in SQL or Python with visibility and control at any size.

So … what you are telling me is that it’s dbt … but with Python? Interesting enough concept, I should say. One would have to surmise that most people using SQLMesh would be using … SQL! Look at how smart I am.

August 14, 2025

PyArrow for Large Dataset Processing

Lazy Execution with Polars and DuckDB

Running Llama 3.1 8B Locally (LangChain and SQLite)

DuckDB vs Polars. Wait. DuckDB and Polars.

DuckDB vs Polars? That’s the wrong question.

Apache Iceberg Writes with DuckDB (or not)

How to tune Spark Shuffle Partitions.

Is Data Modeling Dead?

Polars on GPU: Blazing Fast DataFrames for Engineers

Becoming a Senior+ Engineer in the Age of AI

What is SQLMesh and how is it different from dbt?

What SQLMesh Is

Interesting links

Pages

Categories

Archive