Ramblings Archives - Page 2 of 5 - Confessions of a Data Guy

How to be Better Than Everyone Else

Ok. Get off your high horse. You are human just like the rest of us. Just like your ancient ancestors who were throwing rocks and sticks at each other a thousand years ago … you are looking for a leg up on the competition. Isn’t that the world we live in? At the end of the day, no one is looking out for you, besides you.

Do you know what all those famous “influencers” have over you? Those you look at with boiling bitterness buried somewhere inside you. Besides a smattering of luck here and there, there is one other trait they have that most people don’t.

There are general branches and streams of actions that come off that main branch. But without that central drive, all else is useless.

What is it that apparently 1% of the data population has that the 99 don’t? Simply put, in the words of your grandparents, hard work.

November 25, 2023

Data Engineering, Ramblings

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that data.

December 30, 2022

Big Data, Data, Data Engineering, Python, Ramblings, Rust

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, doesn’t it?

Would anyone like a nice big slice of groupBy, maybe agg is what you need? No? Can you say distributed data set? Whatever it is you’re looking for, I’m quite sure a nice old DataFrame can give it to you. With so many options to choose from … what do you choose? I don’t know, whatever works best for you. But, it does set the stage nicely for a clash of the titans per see.

Let’s do this just that. Straight out of the box performance test. Bunch of CSV’s, a little aggregation, just some simple stuff. Mirror mirror on the wall, who is the fastest with DataFrames of them all?

December 10, 2022

Big Data, Data, Data Engineering, Ramblings

Why Data Migrations Suck.

I’ve often wondered what purgatory would be like, doing penance for millennia into eternity. It would probably be doing data migrations. I suppose they are not all that dissimilar from normal software migrations, but there are a few things that make data migrations a little more horrible and soul-sucking. Data migrations are able to slow teams down to a crawl, take at least twice as long as planned, and be way more difficult than imagined.

Can’t it be made easy, shouldn’t Data Migrations have been conquered by now? I mean just put together the perfect plan, break up the work, make a bunch of tickets, estimate the work, and the rest falls into place? If only.

December 5, 2022

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

A Tale of Betrayal and Heartbreak – Databricks Workflows and Jobs.

Nothing captures the imagination and heart like a tale of betrayal and heartbreak, and that is a tale I want to bring to you today. It’s a tale of Databricks Workflows and Jobs, version changes, new features, API’s, and insidious little hidden gems that will make you pull your hair out when you find them. It’s a tale of what not to do, a tale of how to put developer and customer experience first, instead of forcing unwanted solutions down the throats of the little birdies feeding at your nest.

As a Data Engineering simplicity and ease of use is something close to my heart, something that Databricks did well, or maybe I should say used to do well … before recent releases like Jobs 2.1 API. I hope you can hear the bitterness oozing from my words.

December 1, 2022

Data, Data Engineering, Data Quality, Ramblings

A Diatribe against Data Contracts and their Abuses.

Ok, so I don’t really mean all that. Or do I? I have no idea what the future holds. Sometimes it’s easy to pick out the winners, like Databricks and Snowflake, you can see, feel, and taste the results of those data products, a delicious and delectable bounty to feast upon. Other things are harder to read the tea leaves on. Kinda like Data Mesh … is it a thing, or is it not a thing? It’s hard to decern between charlatans and marketing/sales departments hocking the next Cure All Snake Oil and real life.

What about all this recent humdrum and buzz around Data Contracts? Pushed by some popular Data Engineering faces like Ananth Packkildurai and Chad Sanderson. What is all the hype about Data Contracts, are folks just pushing another tool down our throats? Is there a real issue and problem that can be solved with Data Contracts?

November 16, 2022

Data, Data Engineering, Ramblings

5 Years of Blogging – Most Popular Articles, Traffic Stats, and other Thoughts.

Sometimes I feel like I’ve been doing this too long, life gets busy, and I don’t have much to say … but here I am 5 years later. I’m still making people mad and making a fool of myself, some things never change. This will probably be short and sweet. I will cover the top 10 most popular blog posts from those 5 years, what the traffic has looked like over time, and what I’ve learned from writing blogs for so long, the good, the bad, and the ugly.

October 7, 2022

Big Data, Data, Data Engineering, Machine Learning, Ramblings

Machine Learning from the viewpoint of an average Data Engineer.

I’ve been thinking more about the topic of ML and MLOps lately. To me, it seems like the buzz has quieted down over the last few years about ML and MLOps, at least somewhat, in favor of other topics like Data Quality, Data Lakes, Data Contracts, and the like. I’ve been wondering why this is the case and comparing my experience over the last few years of working in, on, and around ML pipelines and systems. I’ve seen ML done at companies with a few thousand employees, and with a handful of employees. The problems and hurdles at the same across the board, and mostly everyone is not very good at it.

September 26, 2022

Data, Data Engineering, Ramblings

8 Data Engineering Best Practices

Best practices are always a touchy subject, I’m going to forget someone’s pet best practice, I can already feel it. I’ve always been a firm believer in the basics, keeping things simple. I also ascribe to the 80/20 rules, and I don’t think Data Engineering is any different in that respect. Learning to do a few things well, in the long run will probably solve most of your major problems encountered in data teams and architectures. Today I want to give you 8 Data Engineering best practices to hopefully give you some food for thought at least.

September 21, 2022

Big Data, Data, Data Engineering, Ramblings

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

In the beginning, I always thought the humdrum Big O Notation discussions should be reserved for Software Engineers who enjoyed working on such things. I mean, what could it possibly have to do with Data Engineering? I mean, if you are the person writing the Spark application, by all means, have at it, but if you are the Data Engineer who is simply using Spark, why can’t you just leave the details to the Devil? Seems to make sense.

The only problem with that logic is the longer you work as a Data Engineer, probably the harder the problems you work on become, you write more and more code, and basically end up being a specialized Software Engineer … even if you don’t want to be. In the end, to be a good Data Engineer you should at least attempt to understand the concepts behind Big O Notation, and how those concepts can apply to you as Data Engineer, especially for the ETL that most of us write.

August 15, 2022

How to be Better Than Everyone Else

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

Why Data Migrations Suck.

A Tale of Betrayal and Heartbreak – Databricks Workflows and Jobs.

A Diatribe against Data Contracts and their Abuses.

5 Years of Blogging – Most Popular Articles, Traffic Stats, and other Thoughts.

Machine Learning from the viewpoint of an average Data Engineer.

8 Data Engineering Best Practices

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

Interesting links

Pages

Categories

Archive