SQL Bad, Reddit Mad

SparkSQL is Destroying your Pipelines

It’s true, even if you don’t want it to be. SparkSQL is destroying your data pipelines and possibly wreaking havoc on your entire data team, infrastructure, and life. In your heart of hearts, you’ve probably known it for years. With great power comes great responsibility. We all know that even us Data Engineers are human and fallible.

Once those tentacles of SparkSQL get their hold on you, the probability of survival is low. Sure, there are a few wizened old engineers with enough battle scars to make it through unscathed. The rest of us will be maimed.

Read more

Datafusion SQL CLI – Look Ma, I made a new ETL tool.

Sometimes I just need something new and interesting to work on, to keep me engaged. A few days ago I was lying by the river next to a fire, with the cold air blowing on my face and the eagles soaring above. Thinking about and contemplating life and data engineering … something flitted across my mind, just a little fragment of an idea someone had written about.

The little fragment had to do with Datafusion, a Rust-based query engine, and something about it having a SQL CLI interface.

What an interesting thing. I’ve used Datafusion a few times, here and there, I love Rust because it’s fast. I’m a Data Engineer so I’m eternally enslaved to SQL whether I like it or not. This whole thing just seemed like an interesting little tidbit to poke at.

It basically made me wonder if I could combine the Datafusion SQL CLI with bash into a new ETL tool. Simple, small, fast, and maybe fun? Just because I can?

Read more