November 2020 - Confessions of a Data Guy

Scala with Text Files and ElasticSearch

November 20, 2020

I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in Python, so all the snotty people who complain about Python not being fast enough or whatever can go hangout with this cow, looks like he could use a friend. This is something I’ve been meaning to do for awhile. Use Scala to read some text file(s), and store the data somewhere with some client. I chose ElasticSearch. I really just wanted practice doing something simple like reading files and I was curious about how good the Scala clients are for popular tools.

Intro to Apache Beam for Data Engineers

November 17, 2020

What is this thing? What’s it good for? Who’s using it and why? That’s pretty much what I ask myself once a month when I actually see the name Apache Beam pop up in some feed I’m scrolling through. I figured it has to be legit to be Apache incubated, but I’ve never run across anyone in the wild using it yet. On the surface it appears to be semi-pointless since it runs on-top of other distributed systems like Spark, but I’m sure there is more to it. Today, I’m going to run through an overview of Apache Beam and then try installing and running some data through it, kick the tires as it were. And see if my mind changes about the pointless bit.

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

November 11, 2020

Sometimes Pandas is slow like this…. until you tweak it.

I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql() . Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow and sucks… and not taking time to read the documentation. Here are some musings on using the to_sql() in Pandas and how you should configure to not pull your hair out.

The Battlefield of the Data Engineer.

November 2, 2020

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a good Data Engineer. Just the life of a Data Engineer in general. The Battlefield of the Data Engineer is fought in the labyrinth of nested SQL queries. It rages to the depths of distributed computing clusters. It vies for victory on the crags and peaks of DevOps. It attacks for precious ground amid the chaos of the perfect OOP and Functional code bases. Phew… and all that just to keep your head above water.

Scala with Text Files and ElasticSearch

Intro to Apache Beam for Data Engineers

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

The Battlefield of the Data Engineer.

Interesting links

Pages

Categories

Archive