Writing Apache Spark with Rust! Spark Connect Introduced.

I never thought I would live to see the day, it’s crazy. I’m not sure who’s idea it was to make it possible to write Apache Spark with Rust, Golang, or Python … but they are all genius.

As of Apache Spark 3.4 it is now possible to use Spark Connect … a thin API client on a Spark Cluster ontop of the DataFrame API.

You can now connect backend systems and code, using Rust or Golang etc, to a Spark Server and run commands and get results remotely. Simply amazing.  A new era of tools and products is going to be unleashed on us.  We are no longer chained to the JVM. The walls have been broken down. The future is bright.

DuckDB + Delta Lake (the new lake house?)

I always leave it to my dear readers and followers to give me pokes in the right direction. Nothing like the teaming masses to set you straight. Recently I was working on my Substack Newsletter, on the topic of Polars + Delta Lake, reading remove files from s3 … I left a question open on my LinkedIn account.

I had someone jog my leaky memory in favor of DuckDB. I haven’t touched DuckDB in some time, and I’m sure it’s under heavy development what with that Mother Duck and all.

So, it’s time to talk about DuckDB + Delta Lake.

Read more

Ballista (Rust) vs Apache Spark. A Tale of Woe.

Sometimes it seems like the Data Engineering landscape is starting to shoot off into infinity. With the rise of Rust, new tools like DuckDB, Polars, and whatever else, things do seem to shifting at a fundamental level. It seems like there is someone at the base of a titering rock with a crowbar, picking and prying away, determined to spill tools like Java, Scala, Python, Spark, and Airflow, the things we’ve known and loved for years, from their lofty thrones.

Maybe they all have had their time in the Data Engineering sun, maybe it’s time to shake things up. It seems to be happening. It’s always hard to have those we hold dear be poked and prodded at. I’ve been using Spark since before it was cool, so when I started to hear the word Ballista start to show up here and there, I took note.

Besides, I’ve been dabbling my grubby little fingers in Rust for some months now, and have seen The Light. Is it possible I could be living at the dawn of a new era? A new and exciting frontier of Data Engineering, finally, after all this time? Could Rust really take over? Will something like Ballista pull that old Spark from its distributed processing tower and claim its rightful place?

Read more

Exploring Graphs in Rust. Yikes.

I’ve been a dog licking my wounds for some time now. Over on my Substack newsletter, I’ve been doing a small series on DSA (Data Structures and Algorithms). I tackled some of the easier stuff first, like Linked Lists, Binary Search, and the like. What’s more, I actually did most of it in Rust, since I’ve possibly, maybe slightly, every so slightly, fallen in love with Rust.

Like most relationships, it vacillates between pure adoration and utter hatred, depending on the problem at hand. When I did a recent article on Graphs, Queues, and BSF, I attempted it in Rust, and was struck a mighty blow, that borrow checker had me down. It seemed doable, but at the time, under time pressure to get the Newsletter out, I reverted to Python and moved on.

Alas, I’m back again, a glutton for punishment. This time I thought I should try another crack at parsing a graph with Rust, but in a real-life situation, no more made-up stuff.  Actual data, actual graph, here we go. All code is on GitHub.

Read more

Conceptual Introduction to Delta Lake.

Old Dog Learn New Tricks? Starburst (Trino) Galaxy and other thoughts.

Sometimes I think Data Engineering is the same as it was 10+ years ago when I started doing it, and sometimes I think everything has changed. It’s probably both. In some ways, the underlying concepts have not moved an inch, some certain truths and axioms still rule over us all like some distant landlord, requiring us to pay the piper at a moment’s notice. Still, with all those things that haven’t changed, the size, velocity, and types of data have exploded. Data sources have run wild, multiple cloud providers, and a plethora of tooling. 

So yes, maybe in a lot of ways Data Engineering has changed, or at least how we do something is a new and wild frontier, with beasts around every corner waiting to devour us in our ignorance. Never mind the wild groups of zealots roaming around seeking converts to their cause and spitting on those unwilling to bend.

Probably like many of you, I’ve had a healthy skepticism of all things new, at least until they have proved themselves out over some time. This is both a good and a bad habit. It can protect you from undo harm and foolishness, but can also be lost opportunity when you pass over the diamond in the rough. I for one, think that if something is worth its weight in salt, it is usually clear, and its obvious value can be discerned quite readily.

Read more

Real Talk about Running Databricks + Delta Lake at Scale.

Anyone who’s been working in Data Land for any time at all, knows that the reality of life very rarely matches the glut of shiny snake oil we get sold on a daily basis. That’s just part of life. Every new tool, every single thingy-ma-bob we think is going to solve all our problems and send us happily into the state of nirvana inside our eternal data pipelines, is a lesson in disappointment.

I get it, there are a lot of nice tools out there. I use some of them every day. But, a healthy dose of reality is good for us all. Don’t lie to yourself. There is no such thing as the perfect tool. There are good tools, bad tools, and tools in between. The Truth is that all tools get pushed to their limits at some point.

We work on small teams, we don’t have all the time in the world, and we have to deliver our data at some point, perfect or not. We cut corners, hopefully, the right ones. That’s part of being wise and putting years of data experience to work. Today I’m going to talk about my experience of running Databricks + Delta Lake at scale. What happens when you use Databricks to ingest and deal with 10’s of millions of records a day, billions+ records a month?

Read more

The Dog Days of PySpark

 

PySpark. One of those things to hate and love, well … kinda hard not to love. PySpark is the abstraction that lets a bazillion Data Engineers forget about that blight Scala and cuddle their wonderfully soft and ever-kind Python code, while choking down gobs of data like some Harkonnen glutton.

But, that comes with a price. The price of our own laziness and that idea that all that glitters is gold, to take the easy path. One of the main problems is the dreadful mistake of mixing native Python in with your PySpark and expecting things to go fine at scale. Which it most assuredly will not.

Read more

What is a Data Mesh?

Data Types in Delta Lake + Spark. Join and Storage Performance.

Hmm … data types. We all know they are important, but we don’t take them very seriously. I mean we know the difference between boolean, string, and integers, those are easy to get right. But we all get sloppy, sometimes we got the string and varchar route because we don’t spend enough time on the data model to care.

Can a string versus a int or bigint in Delta Lake with Spark have a big impact on performance? Data size? Does it matter? Let’s find out.

Read more