Some poor Data Engineer is sweating and typing away in a dark closet … moving data, solving bugs, just trying to get through the day. Why should the ‘ole Data Engineer care about the huff-a-luff around the billion dollar series recently done by DataBricks? I mean what possible reverence could it have on the day to day life of a Data Engineer and why should they care at all? You ever heard of that proverbial light at the end of the tunnel is actually a train steaming your way ready to pulverize you? That’s why.

Read more

I’ve always been surprised with the rise of data engineering and big data, how hard it is to find good data engineering content that is somewhat regular. Tech moves fast and I feel like data engineering moves even faster. There are always new tools and systems coming out with regular frequency, it’s hard to keep up with what’s hot and whats not. But, I still think it’s important to keep a finger on the pulse of what tech stacks are starting to take over (Spark) and what is fading into oblivion. So here is my top ten list of data engineering blogs, these are the places that I frequent so I at least know what’s going on in the world of data engineering.

Read more

I’ve always been surprised at the distinct lack of most Python code I’ve seen using the map() and filter() methods as standalone functions. I’ve always found them useful and easy to use, but I don’t often come across them in the wild, I’ve even been asked to remove them from my MR/PR’s, for no other reason then that they are supposedly ambiguous to some people? That’s got me thinking a lot about map() and filter() as related to readability, functional programming, side effects and other never ending debates where no one can even agree on the “correct” definition. Seriously. But, I will leave that rant for another time.

Read more
Trying to learn Scala drives me crazy.

I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in Python, so all the snotty people who complain about Python not being fast enough or whatever can go hangout with this cow, looks like he could use a friend. This is something I’ve been meaning to do for awhile. Use Scala to read some text file(s), and store the data somewhere with some client. I chose ElasticSearch. I really just wanted practice doing something simple like reading files and I was curious about how good the Scala clients are for popular tools.

Read more

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a good Data Engineer. Just the life of a Data Engineer in general. The Battlefield of the Data Engineer is fought in the labyrinth of nested SQL queries. It rages to the depths of distributed computing clusters. It vies for victory on the crags and peaks of DevOps. It attacks for precious ground amid the chaos of the perfect OOP and Functional code bases. Phew… and all that just to keep your head above water.

Read more
Comparing the pypi requests vs httpx packages, who will fall on their face?

Someone recently brought up the new kid on the block, the httpx python package for http work of course. I mean the pypi package
requests has been the de-facto standard forever. Can it really be overthrown? Is this a classic case of “oh how the mighty have fallen”? I want to explore what the new httpx package has to offer, but mostly just …. which one is faster. That is what data engineers really care about.

Read more
Why can’t GCP come up with their own Boto3?

First, let’s set the record straight. GCP is better than AWS. This will be clear to anyone who has used both services for a reasonable amount of time. GCP was built with the developer in mind, the services and tools offered work better, are cleaner, and way simplier. But, there is one thing that is totally annoying. Where is GCP’s answer to AWS’s Python Boto3 library? I mean seriously. Boto3 is the one stop shop to plugin and interact with pretty much every AWS service available, and the documentation is reasonable. Seriously GCP, where you at?

Read more
Software should be a craft first, then a engineering problem second.

Craft first, engineering second.

There’s probably a lot of software programmers, developers, and engineers who will take issue with this. That’s kinda the point. Software should be approached as a craft first, then a engineering problem second. There are so many ways this is true, it’s going to be hard to touch them all. I might be biased but who cares, besides, you are more then welcome to be wrong! I have strong feelings about software as a craft because it affects every aspect of how we write code. The approach you choose will ooze from your code, relationships, teams, interactions, and your career. A software project reflects the people and ideals that built it.

Read more
The curse of the software that never works.

You ever wonder how a room full of what appears to be smart engineers manage to build software that doesn’t work? Given more time and money, it appears to only get worse or no better. It doesn’t make that much sense does it? As someone who writes software it’s hard to see how bugs that bring whole systems down seem never to be fixed. Or how 5 bugs get fixed but 10 more appear in their place.

Read more