Ok. Get off your high horse. You are human just like the rest of us. Just like your ancient ancestors who were throwing rocks and sticks at each other a thousand years ago … you are looking for a leg up on the competition. Isn’t that the world we live in? At the end of the day, no one is looking out for you, besides you.

Do you know what all those famous “influencers” have over you? Those you look at with boiling bitterness buried somewhere inside you. Besides a smattering of luck here and there, there is one other trait they have that most people don’t.

There are general branches and streams of actions that come off that main branch. But without that central drive, all else is useless.

What is it that apparently 1% of the data population has that the 99 don’t? Simply put, in the words of your grandparents, hard work.

Read more

Photo by Jason Leung on Unsplash

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that data.

Read more

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, doesn’t it?

Would anyone like a nice big slice of groupBy, maybe agg is what you need? No? Can you say distributed data set? Whatever it is you’re looking for, I’m quite sure a nice old DataFrame can give it to you. With so many options to choose from … what do you choose? I don’t know, whatever works best for you. But, it does set the stage nicely for a clash of the titans per see.

Let’s do this just that. Straight out of the box performance test. Bunch of CSV’s, a little aggregation, just some simple stuff. Mirror mirror on the wall, who is the fastest with DataFrames of them all?

Read more
Photo by Ray Hennessy on Unsplash

I’ve often wondered what purgatory would be like, doing penance for millennia into eternity. It would probably be doing data migrations. I suppose they are not all that dissimilar from normal software migrations, but there are a few things that make data migrations a little more horrible and soul-sucking. Data migrations are able to slow teams down to a crawl, take at least twice as long as planned, and be way more difficult than imagined.

Can’t it be made easy, shouldn’t Data Migrations have been conquered by now? I mean just put together the perfect plan, break up the work, make a bunch of tickets, estimate the work, and the rest falls into place? If only.

Read more
Photo by Francesco Alberti on Unsplash

Nothing captures the imagination and heart like a tale of betrayal and heartbreak, and that is a tale I want to bring to you today. It’s a tale of Databricks Workflows and Jobs, version changes, new features, API’s, and insidious little hidden gems that will make you pull your hair out when you find them. It’s a tale of what not to do, a tale of how to put developer and customer experience first, instead of forcing unwanted solutions down the throats of the little birdies feeding at your nest.

As a Data Engineering simplicity and ease of use is something close to my heart, something that Databricks did well, or maybe I should say used to do well … before recent releases like Jobs 2.1 API. I hope you can hear the bitterness oozing from my words.

Read more

Ok, so I don’t really mean all that. Or do I? I have no idea what the future holds. Sometimes it’s easy to pick out the winners, like Databricks and Snowflake, you can see, feel, and taste the results of those data products, a delicious and delectable bounty to feast upon. Other things are harder to read the tea leaves on. Kinda like Data Mesh … is it a thing, or is it not a thing? It’s hard to decern between charlatans and marketing/sales departments hocking the next Cure All Snake Oil and real life.

What about all this recent humdrum and buzz around Data Contracts? Pushed by some popular Data Engineering faces like Ananth Packkildurai and Chad Sanderson. What is all the hype about Data Contracts, are folks just pushing another tool down our throats? Is there a real issue and problem that can be solved with Data Contracts?

Read more
Photo by Mitchel Boot on Unsplash

Sometimes I feel like I’ve been doing this too long, life gets busy, and I don’t have much to say … but here I am 5 years later. I’m still making people mad and making a fool of myself, some things never change. This will probably be short and sweet. I will cover the top 10 most popular blog posts from those 5 years, what the traffic has looked like over time, and what I’ve learned from writing blogs for so long, the good, the bad, and the ugly.

Read more
Photo by Kevin Ku on Unsplash

I’ve been thinking more about the topic of ML and MLOps lately. To me, it seems like the buzz has quieted down over the last few years about ML and MLOps, at least somewhat, in favor of other topics like Data Quality, Data Lakes, Data Contracts, and the like. I’ve been wondering why this is the case and comparing my experience over the last few years of working in, on, and around ML pipelines and systems. I’ve seen ML done at companies with a few thousand employees, and with a handful of employees. The problems and hurdles at the same across the board, and mostly everyone is not very good at it.

Read more
Photo by Alain Pham on Unsplash

Best practices are always a touchy subject, I’m going to forget someone’s pet best practice, I can already feel it. I’ve always been a firm believer in the basics, keeping things simple. I also ascribe to the 80/20 rules, and I don’t think Data Engineering is any different in that respect. Learning to do a few things well, in the long run will probably solve most of your major problems encountered in data teams and architectures. Today I want to give you 8 Data Engineering best practices to hopefully give you some food for thought at least.

Read more
Photo by Dan Lohmar on Unsplash

In the beginning, I always thought the humdrum Big O Notation discussions should be reserved for Software Engineers who enjoyed working on such things. I mean, what could it possibly have to do with Data Engineering? I mean, if you are the person writing the Spark application, by all means, have at it, but if you are the Data Engineer who is simply using Spark, why can’t you just leave the details to the Devil? Seems to make sense.

The only problem with that logic is the longer you work as a Data Engineer, probably the harder the problems you work on become, you write more and more code, and basically end up being a specialized Software Engineer … even if you don’t want to be. In the end, to be a good Data Engineer you should at least attempt to understand the concepts behind Big O Notation, and how those concepts can apply to you as Data Engineer, especially for the ETL that most of us write.

Read more