Polars is one of those tools that you just want … no … NEED a reason to use it. It’s gotten so bad, I’ve started to use it in my Rust code on the side, Polars that is. I mean you have a problem if you could use Polars Python, and you find yourself using Polars Rust. Glutton for punishment I guess.

I also recently took personal offense when someone at a birthday party told me that everyone uses Pandas, and no one uses Polars in the real world. Dang. That hurt.

The reality is that I know it takes a long while for even the best technologies to be adopted. Things don’t just change overnight. But there are two hidden gems of Polars that will hasten the day when Polars replaced Pandas for good. Let’s talk about them.

Read more

I was wondering the other day … since Polars now has a SQL context and is getting more popular by the day, do I need DuckDB anymore? These two tools are hot. Very hot. I haven’t seen this since Databricks and Snowflake first came out and started throwing mud at each other.

You might think it doesn’t matter. Two of one, half-dozen of another, whatever. But I think about these things. Simplicity is underrated these days. If you have two tools but could do it with one, should you use two? Probably depends on the Engineering culture you’re working in.

I mean just because you can doesn’t mean you should. Some data engineering repo with 50 different Python pip packages installed, constantly breaking and upgrading for no reason. CI/CD build failing, conflicts. Frustration. Why? Just because someone wants to do this one thing and decided they needed yet another package to do it.

Read more

 

PySpark. One of those things to hate and love, well … kinda hard not to love. PySpark is the abstraction that lets a bazillion Data Engineers forget about that blight Scala and cuddle their wonderfully soft and ever-kind Python code, while choking down gobs of data like some Harkonnen glutton.

But, that comes with a price. The price of our own laziness and that idea that all that glitters is gold, to take the easy path. One of the main problems is the dreadful mistake of mixing native Python in with your PySpark and expecting things to go fine at scale. Which it most assuredly will not.

Read more

Save money, save money!! Hear Hear! Someone on Linkedin recently brought up the point that companies could save gobs of money by swapping out AWS Python lambdas for Rust ones. While it raised the ire of many a Python Data Engineer, I thought it sounded like a great idea. At least it’s an excuse to play with Rust, and I will take all those I can get. It does seem like an easy and obvious step to take in this age of cost-cutting that has come down on us all like that thick blanket of fog on a cool spring morning.

I can personally attest to the fact that I’ve written a number of Python AWS lambdas that are doing a non-trivial amount of data processing, currently running in Production and being triggered many times a day. Today, I’m going to reproduce both a Python and Rust lambda running on my personal AWS account doing pretty much the same exact work. Let’s see what the difference actually is in performance and see if it’s possible to find some cost savings.

Read more

Photo by Stone Wang on Unsplash

I remember those days, oh so long ago, it seems like another lifetime. I haven’t used Pandas in many a year, decades, or whatever. We’ve all been there, done that. Pandas I mean. I would dare say it’s a rite of passage for most data folk. For those using Python, it’s probably one of the first packages you use other than say … requests?

You know, Pandas feels like Airflow, everyone keeps talking about its demise, but there it is everywhere … used by everyone. Sure it’s old, wrinkled, annoying, slow, and obtuse, but it’s ours, and that makes it the words of Gollum … precious.

We should probably get to the point already. Everyone is talking about Polars. Polars is supposed to replace Pandas. Will it? Maybe 10 years from now. You can’t untangle Pandas from everywhere it exists overnight. Do you still want to replace Pandas with Polars and be one of the cool kids? Ok. Let’s take a look at a practical guide to replacing Pandas with Polars, comparing functionally used by most people. My code is available on GitHub.

Read more

We’ve all been in that spot, especially in tech. You wanted to fit in, be cool, and look smart, so you didn’t ask any questions. And now it’s too late. You’re stuck. Now you simply can’t ask … you’re too afraid. I get it. Apache Arrow is probably one of those things. It keeps popping up here and there and everywhere.

The only reason I know anything about Arrow is that some years ago, circa 2019 and earlier I stumbled into Arrow and used it to read and write Parquet files (pyarrow that is). Heck, I even used it to tie together Python and Hadoop, Lord knows what I was thinking back then. I’m amazed at how much I used PyArrow back in the day, even to compare Parquet vs Avro.

“Back then it seems like no one used Arrow much, no one was writing about it, using it, or talking about it. At least not that I saw. But oh how times have changed. Arrow seems to be showing up everywhere and is starting to become a backbone for many other tools.”

– me
Read more

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, doesn’t it?

Would anyone like a nice big slice of groupBy, maybe agg is what you need? No? Can you say distributed data set? Whatever it is you’re looking for, I’m quite sure a nice old DataFrame can give it to you. With so many options to choose from … what do you choose? I don’t know, whatever works best for you. But, it does set the stage nicely for a clash of the titans per see.

Let’s do this just that. Straight out of the box performance test. Bunch of CSV’s, a little aggregation, just some simple stuff. Mirror mirror on the wall, who is the fastest with DataFrames of them all?

Read more
Photo by Aziz Acharki on Unsplash

Do this, not that. Well, I’ve got my own list. With everyone jumping on the PySpark / Databricks / EMR / Glue / Whatever bandwagon I thought it was long overdue for a post on what to do, and not to do when working with Spark / PySpark. I take the pragmatic approach to working with Spark, it’s honestly very forgiving well and far into the 10s of TBs of data. Once you wander past that point things tend to get a little spicy if you don’t have it all dialed in. As with most things in life if you get a few things right, and of course don’t do some things, that will get you a long way, the same applies to Spark.

Read more
Photo by Tim Schmidbauer on Unsplash

Ever since playing with Great Expectations with Spark some time ago, I’ve been on the lookout for more Data Quality at-scale tools. The market still has a long way to go with these tools, not enough options, hard to use, and the typical Data Engineering travails. I came across soda-core recently, a self-proclaimed…

Data reliability testing for SQL- and Spark- accesssible data.

soda-core docs

Doing anything at scale, well … that’s usually the problem. Data Quality and Observability are topics were hear a lot about these days. The reality often doesn’t meet the expectations most of the time. Even Great Expectations, being awesome, can get complicated real quick-like. Let’s hope that soda-core pair with Spark can show us some real promise. Code available on GitHub.

Read more
Photo by José Ramos on Unsplash

I’ve always been a firm believer in using the right tool for the job. Sometimes I look at a piece of code … and ask … why? I mean just because you can do something doesn’t mean that you should. I see a lot of my job as someone who writes code … as not just my ability to write code, but the ability to reason about problems and design simple and elegant solutions that solve the problem at hand.

I try not to let my love of a tool, language, or package color my view of the world as it is. In fact, there is wisdom to be found in being critical of those languages and tools you love the most. Be aware of their shortcomings and failures. This leads to better software and architecture designs, and less complexity. Too often I’ve seen folks picking their tool of choice and then sticking with it till the bitter end, and it usually is bitter. There is more to life than writing obtuse Scala code that is illegible for some mundane task.

This sort of thing is a blight on everyone and every system. Now I must descend from my high horse and join the peasants on the dusty road of life. Today I want to look at some very common Data Engineering tasks, namely cloud storage, and what it is like to do such a thing with Golang, Rust, and Python. I will let you draw your own conclusions. Maybe. Code available on GitHub.

Read more