November 2025 - Confessions of a Data Guy

PyArrow for Large Dataset Processing

Over the last few years, I’ve found myself using PyArrow more and more for everyday data engineering things. Data ingestion, reading, and writing from various data sources and sinks. Most of us are familiar with Arrow and how it underpins a lot of new tech like DataFusion, and Arrow is used as an internal memory format.

But, small and mighty though it might be, the pyarrow Python package is a force to be reckoned with. Capable of blasting through all sorts of cloud-based datasets. It’s not particularly a data transformation framework, as much as a way to represent core datasets, transferring data hither and thither over the wire from one format to another.

Lazy Execution with Polars and DuckDB

Something I’ve taken for granted for a long time, and then suddenly discover others are discovering for the first time, leaves me a little baffled. It makes me wonder how many folks are living under a proverbial rock. Recently, I saw a post on that infamous LinkedIn about someone excited about Polars’ lazy execution.

Cluster Fatigue. Polars and PyArrow to Postgres and Apache Iceberg (streaming mode)

I’ve been working lately, on moving expensive distributed compute jobs (that don’t need to be distributed) from Spark, to other single node tools and frameworks. To be honest, there are reasons that Data Platforms might pick Spark, for example, and just keep everything on Spark, even if it doesn’t need Spark.

Yes, it costs more. Yes, it makes things simple.

We try to balance the need for simplicity, fast development iterations, low mental ramps to get up to speed, reliability. It’s a difficult line to walk for some of us. Things change, we are asked to save costs … so I find myself kicking Spark to the curb in favor of Polars.

Yes it increases the lines of code written, yes there is more mental burden, but at the same time, costs come down. We get to do something new in production that is fun, keeps interest running.

Large datasets, small compute, streaming needs.

One of the struggles I have today, is the actual reality of using not tiny datasets, with smallish compute, in a production setting that requires certain integrations be done well. Most of all, providing the ability to keep memory pressure low, while keeping throughput high, and integrating with the modern Lake House architecture.

PyArrow for Large Dataset Processing

Lazy Execution with Polars and DuckDB

Cluster Fatigue. Polars and PyArrow to Postgres and Apache Iceberg (streaming mode)

Large datasets, small compute, streaming needs.

Interesting links

Pages

Categories

Archive