Thoughts on Distributed Data Pipelines – Spark vs Kubernetes

Data Pipelines – Spark vs Kubernetes, or both?

Data gets bigger and teams want to process data faster, what else can you do? There is only so much code tweaking you can do, threads, processes, asyncio, it’s only going to get you so far. At some point you have terabytes of data to process, and it requires a decision about some sort of distributed processing system.

In my experience I’ve mostly used two different distributed data processing systems in production, Spark and Kubernetes. To be honest the choice has always been obvious when to choose one over the other. The data usually dictates which system you choose. I’m sure there are super fans of each system who would argue there’s always a way to do any transform or process on each, but sometimes the point is, which system is setup to easily and quickly move the data from one point to another, and transform it as needed.

Which to Choose?

Let’s talk about Spark and Kubernetes. So on the surface most people would say Kubernetes isn’t really a distributed data processing system, it’s a container orchestration system. I would say, that sounds like a great way to have a bunch of containers processing data for me. There’s probably a lot of people running Spark on Kubernetes for example. But, there are a lot of specialty cases, like processing satellite data, that don’t fit into Spark when you need something like GDAL.

Spark. I’ve written a few posts, here and here on it in the past, and I’ve come to love it. It’s a must have for any data engineer. But, I’ve noticed it thrives in the normal structured and semi-structured data space, things like JSON, CSV, and other Data Frame like data. Things become a little more difficult when you start needing some sort of geospatial tools and processing. There are a few tools like geotrellis, but they focus on Scala and aren’t as extensive and tried-and-true like GDAL.

This is where Kubernetes comes in. Container orchestration is a beautiful thing, of course you will need to learn Docker, but that takes about all of 5 minutes. When you really thing about it, most Data Pipelines consist of doing the same thing many times over, downloading files, doing some transform, putting the data somewhere for storage. Of course Kubernetes lends itself to this type of distribution.

Basically Kubernetes is a cluster of node pools, waiting to run as many containers or pods of work you want to throw at it. So let’s say you have a bunch of specialty geospatial work inside a data pipeline you need to do. Download some large images, process them with GDAL, and save them off. Basically all you need to do is write a docker container that does this for 1 or more files. Install everything you need into the Docker image, your tools, your code etc. Then it’s matter of spending most of your time writing some simple code to throw a bunch of job’s at Kubernetes that turn into Pods.

Pick The Obvious Choice.

If your new to Kubernetes and want a managed system, check out GKE, Google’s managed Kubernetes engine, it’s amazing and easy to use, and comes with a sweet Python library for interaction. When it comes to data pipelines just try to think past all the soundbites. Focus on the problem you are trying to solve, break it down to a simplistic form and pick the obvious tool. Are you working with structured or semi-structured data? Pick Spark. Are you working a problem that requires specialized tooling and a specific setup? Pick Kubernetes and spend the extra time it takes to build your own pipeline.