Concurrently Download Large Files from GCS.

Waiting for large files to download is boring.

There is nothing more annoying than sitting around waiting for files to download. That was true while I was in high school staring at LimeWire, it’s still true today. Especially when you’re a data engineer who’s supposed to make data pipelines fast. You’re in luck! Yes, it is possible to download a large file from Google Cloud Storage (GCS) concurrently in Python. It took a little digging in Google’s terrible documentation for their Python cloud storage wrapper (hear my snarky-ness), but I found a diamond in the rough.

Read more

Data Engineering vs Data Science – Where’s The Love??

The Tech Fight of the Century, Data Engineering vs Data Science.

It seems like a never ending battle for supremacy. Articles about Data Science being the bee’s knees, then more articles about how Data Engineering holds up the world of Data Science like Atlas. Whenever I read something in one of these two categories on Medium or wherever, it just seems more like ego clash to me. It’s human nature to want to be the best, to be better, to feel like you are the person who really makes it all happen.

Read more

Thoughts on Distributed Data Pipelines – Spark vs Kubernetes

Data Pipelines – Spark vs Kubernetes, or both?

Data gets bigger and teams want to process data faster, what else can you do? There is only so much code tweaking you can do, threads, processes, asyncio, it’s only going to get you so far. At some point you have terabytes of data to process, and it requires a decision about some sort of distributed processing system.

In my experience I’ve mostly used two different distributed data processing systems in production, Spark and Kubernetes. To be honest the choice has always been obvious when to choose one over the other. The data usually dictates which system you choose. I’m sure there are super fans of each system who would argue there’s always a way to do any transform or process on each, but sometimes the point is, which system is setup to easily and quickly move the data from one point to another, and transform it as needed.

Read more

You Have to Try This… from io import StringIO, BytesIO

StringIO and BytesIO are perfect for making your Python faster.

Ever heard of something called a File Object in Python? Ever heard of BytesIO or StringIO? Your missing out. It’s easy, fast, and wonderful, in short, it’s the best. For some reason IO streams are a totally underused feature that rarely comes up in most code. We all know that memory if faster than disk IO, this is what I use IO streams for.

Read more