Machine Learning from the viewpoint of an average Data Engineer.

I’ve been thinking more about the topic of ML and MLOps lately. To me, it seems like the buzz has quieted down over the last few years about ML and MLOps, at least somewhat, in favor of other topics like Data Quality, Data Lakes, Data Contracts, and the like. I’ve been wondering why this is the case and comparing my experience over the last few years of working in, on, and around ML pipelines and systems. I’ve seen ML done at companies with a few thousand employees, and with a handful of employees. The problems and hurdles at the same across the board, and mostly everyone is not very good at it.

What is it about ML / MLOps that folks don’t understand?

I think there are a lot of misconceptions about ML generally in the software world, including the Data Engineering landscape, which is closely tied to the Data Sciences space that drives a lot of the ML projects. I’m just going to start with stating some assertions that I have found true in my experience watching ML and MLOps play out in various teams of various sizes, which end up having surprising results.

I’m sure I will raise all sorts of ire across the board, from the Data Scientist, the elusive and far to smarty pants Machine Learning Engineers, and the like. The older I get the more I like to stir the pot.

Truths about Machine Learning.

MLOps is far and above more work and more difficult than the R&D to find some “model” that works well.
Data Scientists think they write good code and design good architecture, but in real life are mostly terrible from an SWE perspective.
Data is always a problem, its format, its dirtiness, its volume, and its breadth.
MLOps tooling is always fragmented, and the one-size-fits-all tool always disappoints.
There is rarely ever “good” integration and teamwork between the Scientist and Engineering.
“Forcing” Data Scientists to write better code, unit tests, and be like SWEs will make them mad and fail.
Pairing a single Data Engineer and Data Scientist to work on a problem, from the beginning, together, will result in success.
Knowing who ran what model, with what data, and what results, and how to re-run it … is one of the biggest hurdles.
The easier it is to “run” a model, the better, either by command line, or some UI.

MLOps is far and above more work and more difficult than the R&D to find some “model” that works well.

To me, it seems there is a prevailing misconception out there among the masses, especially among those that don’t do ML work or haven’t worked closely on some ML systems, that the black magic of producing a model … well … is the black magic. This usually isn’t the case. Most often the complexity and work go into all the little boxes that surround the actual training and production of the model in question.

There is always so many more hurdles when it comes to data, metadata, tracking that stuff, automating the training, the exercising of a model, the feature generation, what to do with and how to tack and store the predictions, etc etc. Sure, some smart Data Scientist goes off in the corner to theorize about what features to produce, how they affect the model and the meaning of the results. But that is the smallest piece of the puzzle usually, it often doesn’t take a DS that long to come up with some model that has performance that is reasonable, and that is not what keeps ML projects out of Production deployment!

What keeps a model or project out of production, is the inability to reasonably and reliably deploy and exercise models, along with the production of features, tracking etc. That’s typically where the train runs off the tracks.

Data Scientists think they write good code and design good architecture, but in real life are mostly terrible from an SWE perspective.

I can hear the click-clack of a thousand angry fingers writing me messages, that I will promptly delete, ringing in my ears as I typed out that heading. Well, it’s just true most of the time. Why? Because that isn’t really what Data Scientists are paid to do. Most of them are doing the Data Science work probably because they enjoy writing code, but they enjoy the other parts, otherwise, they would just go be SWEs.

I’ve worked with many very smart DS persons, far and away above my level into another universe. But that same DS can write the most terrible code I’ve seen in my life. That’s just life. Why? Because the Scientist I’ve worked with on X, Y, or Z projects just isn’t really concerned about function code, unit tests, or where the features get stored every day. They are worried bout their model and the next idea.

And, as a Data Engineer working on some ML pipeline … I could care less about what type of model they are choosing, and why, and how excited they are to hyper-parameter tune it … I pretty much care about every single thing else besides that.

Data is always a problem, its format, its dirtiness, its volume, and its breadth.

Another mistake I see folks making is the assumption that data isn’t the single most important part of an ML model project when it clearly is. It isn’t even worth while to work on coming up with a model if you don’t have the right kind of data to support the use case. The data has to be reasonably reliable and continuous. It has to be assessable in an easy manner, and the conversion of data into features, and the tracking and metadata tracking about what data was used to produce what features, and what features were developed and why … this is really the crux of the ML lifecycle.

When some model comes close to making it into production, without fail, there will always be questions posed by outside groups, marketing, product, etc, about the results that the model is producing. What did this happen on this day? Why is this a poor prediction and that isn’t? The list goes on. Most of the time the first answer is found back in the data, its lack thereof, a blip here and there.

Not being able to track and explain the data and features, as related to the model and its results … will sink the ship most of the time.

MLOps tooling is always fragmented, and the one-size-fits-all tool always disappoints.

The most annoying thing about the world of ML and MLOps is that after all this time, it’s still constantly changing and moving. Unless all the stars align and the work is super simple and the gods are shining on us … most ML pipelines are of a wide range enough that if a single tool is picked, other things fall off the edge. No matter what vendor or open source tools try to tell you, or sell you, it’s extremely unlikely that a single tool can be adopted, and poof, all of your problems will be solved and ML is now the easiest thing ever.

What I’m trying to say is … to make an ML project successful, focus on how to solve the problems at hand, using your brain, don’t start with a tool(s) and try to fit the peg into the hole.

There is rarely ever “good” integration and teamwork between the Scientist and Engineering.

What’s new, the age-old problem I guess. Anytime folks start poking at grump old SWEs or DEs, there is bound to be a few sparks that fly. I mean, generally speaking, what a DS wants and what a DE wants, is probably going to be different, and their focus will be different. There are probably going to be problems between the two groups about “how” things should be done, with each group focused on different things that they see as most important.

The only time I’ve ever seen this work fairly well is when a DE or DS are embedded on the same team, with the same manager over both. Why? Because it forces more collaboration and *gasp* people even becoming good friends.

What happens in most places is that some DS team does a bunch of work, then throws a pile of crap over the wall at engineering, who promptly start to freak out and re-write and re-factor and mess with tools etc. All this huff-a-luff leads to heartache and ML projects not making it to production.

“Forcing” Data Scientists to write better code, unit tests, and be like SWEs will make them mad and fail.

This is pretty much self-explanatory. People are people, and they don’t like to do things they don’t want to do. Did you like it when your mom made you eat broccoli? You might have done it, but not with heart, probably with much anger and bitterness. Probably the best way to solve the bad code problem … is to just resign yourself to reality.

Just know that winning the battle is a slow process. Sometimes it’s simply a win to just convince someone that unit tests are a thing and that functions that are 100 lines of code should probably be refactored. You don’t want to spend time on an ML project just fighting back and forth between Engineering and Science, you want to understand each other, and work with each other to start putting pipelines in place that are reliable and repeatable, which in the long run will win the heart of the Science team.

Pairing a single Data Engineer and Data Scientist to work on a problem, from the beginning, together, will result in success.

Friends, friends, friends, that is the way to get things done and be happy in life, and work. The key to success on an ML project is to put an Engineer(s) and Scientist(s) together from the start, letting both of them focus on what they are good at, collaborate, and start with a solid foundation from the very start. The problem with working separately is that it’s hard to bring the wagons back together without some serious heartache.

Two people with different skills can accomplish a lot more things together as a team, than when they are separate, pushing and pulling against each other, fighting for their piece of turf or their way of doing things. Engineers will learn things from Scientists who they work closely with, and the opposite is true. A good Scientist can be supercharged when paired with a good Engineer who can help with the architecture, tooling, and ops problems.

Knowing who ran what model, with what data, and what results, and how to re-run it … is one of the biggest hurdles.

The meta-data is one of the biggest hurdles to good ML pipelines and Ops. It’s pretty much the ability to know what’s going on inside a complex pipeline, transformations, features, predictions, and the like. Sure, dashboards are nice, but that’s not really what I’m talking about. I’m talking about the ability to track the minute changes that take place when training a model, or producing features from a data set for example.

Much of Machine Learning is about experimentation, trying things out, and running new data sets with new features. It doesn’t matter if it’s Marketing asking the question, maybe it’s a Product Manager, or could be a new Data Scientist to the team or project. You have to be able to answer very detailed questions very quickly. What were the results of the last training run? What model version is running, what data set did you use to get those results, the predictions came out last Tuesday, and what was the input data set?

These are very specific questions that are very important, the trust and use of a model system rely on being able to answer them. Think about all the Ops going on in the background to capture that information and provide it in an easily accessible way.

The easier it is to “run” a model, the better, either by command line, or some UI.

This might not seem obvious to everyone, but the simple task of being able to run a model training, or kick off a prediction on a data set … that seemingly simple task … can be overwhelming in many poorly designed and run ML projects. In its worse case, which I have seen with my own two eyes, it could be a long series of tasks requiring various and complex command line arguments to be run, instances spun up, code modified, and many other sundry gyrations, all to train a dumb model. At this point, no one knows how to repeat the process, find the results, or relate the results and data together in any meaningful way.

The ability to run a training set, or run some prediction should be no more than one or two simple command line or GUI button clicks, maybe along with updating some configurations at most. Otherwise, the project is probably doomed from the start, no one will want to touch it, and innovation will be suffocated under the burden of frustration and tears.

Musings on Machine Learning as a Data Engineer.

I’ve seen my fair share of ML pipelines, from simple to complex, some with APIs and online predictions running smoothly. Others have destroyed dozens of Data Scientists, burying them all in sadness and anger, leaving their gleaming dreams a nightmare. What did it come down to? It had nothing to do with the “sophistication” of the model, the parameters being tuned to the utmost, the perfect data.

It all had to do with the infrastructure that makes ML possible, MLOps. It has to do with clean code, tests, meta-data tracking, a good command line or UI, confidence in the process and the approachability of anyone to train the model in a new and different way, or the ability to answer questions about predictions, the features and data input, and why something was acting that way it was.