The Battlefield of the Data Engineer.

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a good Data Engineer. Just the life of a Data Engineer in general. The Battlefield of the Data Engineer is fought in the labyrinth of nested SQL queries. It rages to the depths of distributed computing clusters. It vies for victory on the crags and peaks of DevOps. It attacks for precious ground amid the chaos of the perfect OOP and Functional code bases. Phew… and all that just to keep your head above water.

The Decision Point.

I have found the following to be true in my experience down the winding and dusty path of Data Engineering. There appears to be a decision point in the Quest of survival for the harried Data Engineer.

  • Should I slump into the easy chair of time, reminisce about my past battle stories.
  • Should I forge on into the unknown where I will probably be eaten by a Monster, but I might win glory.
The Data Engineer preparing for a days work.

Armchair dude.

Who are these two people? The first one is the person who has learned SQL, Data Warehousing, and probably one or two GUI ETL tools. They’ve probably designed a few Data Warehouses from scratch, know their way around some of the most complex, nested, head spinning SQL queries every created. I’ve been there. It’s not that hard of a place to get to. It becomes old hat after awhile, you can with ease tune any query, create indexes, third normal form, Kimball, whatever.

Anyone who’s written SQL for 80% of their day for 2+ years probably ends up in this category. As for ETL? Whatever, SSIS, Informatica, SAP Data Services, Talend. I mean you don’t have to use very many drag and drop ETL tools before they all blur together into some battle haze. Input file, map file, setup connection, make small transform. Then it’s back to SQL for the rest of the day, always another business user needing a report.

This dented and battle hardened Data Engineer will always be highly valued and easily employed. The number of companies still running classic Data Warehouse environments and teams (think SQL Server, Oracle, maybe Postgres if they are fancy) is huge. It’s hard to state how important this function is to so many businesses, they give insight and action into the business. The job has changed little in the last 15 years, but that’s ok, it’s essential.

What about that second guy?

Put back on the helmet dude.

The second dude just isn’t satisfied with life. The explosion of “Big Data” and the flood of new tools and services opened the floodgates of adventure. Hadoop, Presto, Flink, Spark, Kafka, GCP, AWS, Redshift, Scala, Python, Kubernetes, blah, blah, and blah. The pull of the next new shiny toy is just too much to bear. They can’t sit in the armchair anymore, they have to blaze out into new trails.

This is the second set of Data Engineers I run into. They have left the confines of SQL and RDBMS and have morphed into polyglot software engineers. They actually probably spend little of their day in SQL, although they could write it with the best of them. Most of their day is spend writing code that will get run on some distributed system, or at least is complex and solving problems that any GUI tool would choke on. Docker, devops, large codebases, nosql, file formats.

The world is their playground now. But it all probably comes with a price.

The Conundrum of the New Data Engineer.

The new Data Engineer, ready for any challenge…but which one?

This is the question I always find myself pondering on the dusty path. The new Big Data tools keep coming rapidly, not everyone chooses the same tool. What to do, what to do? How can you keep with the times and the new younger warriors that never stop rising? The new and bold Data Engineer that isn’t sitting in an armchair has already made a decision. The decision was to push forward, learn new things, conquer new territory. But what territory to conquer?

Apache Spark is the behemoth of the day. It’s like the Barlog, you must conquer it or Thou Shalt Not Pass. But, Spark isn’t that easy to learn and if you can’t do it on the job, how do you do it all? Streams are the future, Pulsar, Kafka, PubSub… you must dive in. Kubernetes, Beam, Flink, Presto, they are not going away. You will eventually run across some Hive.

Oh and you must be an expert with Docker, Git, all sorts of DevOps/CICD… all to get the chance to write Software Engineer level code in Python, Scala, Go, and Java.

How can you possibly learn all this?

Well…. most likely you can’t without a lot of work. I would argue starting to work at “smaller” companies/startups is going to be one of the only ways to do this. You could be like myself, start a blog, spin up clusters, install new tools, test them out. But this will only give you the 10,000 foot view. No amount of tinkering around on the side can replace real world experience, heartbreak, and mistakes. The blood, sweat, and tears of battle just can’t be easily reproduced.

There are certain problems and learning that will most likely only take place on the job.

The smallerish company that doesn’t have the world locked down and allows freedom of choice to an extent is the key for the new Data Engineer. Play with Spark as much as you want on the side, but having to write a production Spark script is probably going to be the only way you to truly understand what problems arise and how to get around them.

How do you get comfortable with Kubernetes? Read something on Medium, but you will never understand it until you’ve struggled through YAML files for a deployment trying to figure out which Service went wrong where with a typo before you truly even understand the system you just deployed. There is something about SSH’n into a POD somewhere and figuring out some networking thingy that just brings a level of clarity.

Code

Probably one of the most important aspects of the new Data Engineer is being able to write clean, well-rounded code. But you actually have to write code to make this happen. Pushing yourself in this area is going to solve a lot of other problems. Being able to write OOP is important, you will run into it as the new Data Engineer, you need to be comfortable with this. Understanding Scala and what functional programming is will just make you question your life, but also make you better. The more code you write, the greater your breadth of experience, the easier new tools and architecture will be for you.

I’ve found you just can’t sit still in this area, even tools like Python shift and change over time. Things and styles come in and out of fashion. But one thing I’ve always noticed, once you’ve pushed yourself to understand your language of choice and its best practices… moving to different teams that value different things will be easier.

The Dumps.

A Data Engineer down in the dumps.

You must avoid getting down in the dumps. There will always be someone who writes better code than you. There will always be people smarter than you. You will make mistakes. You will always look back and code you’ve written and wonder why they hired you. There will always be bugs that take hours and days to solve, all while you question your very worth as a developer. There will always be smart jerks. Just remember these things are good for you. And they show you who you do and don’t want to be.

It’s mostly about perspective and attitude. Do you want to get better? Do you want to learn that new technology? Do you enjoy what you do? These are what matter in the long run. Don’t forget to get a hobby, leave the computer alone sometimes. Read a book, go for a walk. Remember why you do what you do, you write code and solve problems because it’s fun and exciting. There is nothing like the satisfaction of solving a problem and your code pumping out the answer.

Keep pushing yourself, don’t settle down into the armchair. If you aren’t making yourself uncomfortable and getting stuck on hard problems then push harder. Everyone breaths air just like you, if they can do it, so can you.

Conclusion

What kind of Data Engineer am I? I wonder that myself lately. I care more about being pragmatic than perfect. I love what I do and new tools are exciting. I have lots to learn, I am master of none but capable of most all of it. I always want to be learning something new. If I’m not uncomfortable and haven’t broken once thing at least once a week, then I know I’m ever so slowly sitting back in that armchair.

I don’t take life or my code that seriously.

I was supposed to publish “Intro to Apache Pulsar for Data Engineers” tonight but my Pulsar cluster still is working yet…. 🙂