Home - Confessions of a Data Guy

Big Data, Data, Data Engineering, Python, Scala

Converting CSVs to Parquets… with Python and Scala.

With parquet taking over the big data world, as it should, and csv files being that third wheel that just will never go away…. it’s becoming more and more common to see the repetitive task of converting csv files into parquets. There are lots of reasons to do this, compression, fast reads, integrations with tools […]

January 2, 2021

Big Data, Data, Data Engineering, Ramblings

Top 10 Data Engineering Blogs

I’ve always been surprised with the rise of data engineering and big data, how hard it is to find good data engineering content that is somewhat regular. Tech moves fast and I feel like data engineering moves even faster. There are always new tools and systems coming out with regular frequency, it’s hard to keep […]

December 27, 2020

Data, Data Engineering, Python, Ramblings

Musings on Python’s map() and filter()

I’ve always been surprised at the distinct lack of most Python code I’ve seen using the map() and filter() methods as standalone functions. I’ve always found them useful and easy to use, but I don’t often come across them in the wild, I’ve even been asked to remove them from my MR/PR’s, for no other […]

December 16, 2020

Big Data, Data, Data Engineering, Data Warehousing, Python

4 DataWarehouse-ish Functions For Your PySpark Dataframes

Ever felt like just exploring documentation… seeing what you can find? That’s what you do on a cold, first snowstorm of the year Sunday afternoon. After the initial fun has warn off, the kids don’t want to go outside anymore, and Netflix has nothing new to offer up. So I thought I might as well […]

December 14, 2020

Big Data, Data, Data Engineering, Data Warehousing, Python

Intro to Apache Cassandra for Data Engineers

Hmm… yet another distributed database …. will it ever end? Probably not. It’s hard to keep up with them all, even the old ones. That brings me to Apache Cassandra. Of all the popular big data distributed databases Cassandra seems to be kind of that student who always sits in the back row and never […]

December 10, 2020

Data, Data Engineering, Data Warehousing, SQL

Database/SQL Fundamentals for Data Engineers

I’ve meet my fair share of snooty people who poo poo SQL and databases as second class hand-me-downs. I still remember talking to an academic computer science grad who was explaining to me how he refused to teach database classes, he was just too good for that. Whatever. Apparently refusing to accept how 90% of […]

December 4, 2020

Data, Data Engineering, Ramblings, Scala

Scala with Text Files and ElasticSearch

I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in […]

November 20, 2020

Big Data, Data, Data Engineering, Python, Uncategorized

Intro to Apache Beam for Data Engineers

What is this thing? What’s it good for? Who’s using it and why? That’s pretty much what I ask myself once a month when I actually see the name Apache Beam pop up in some feed I’m scrolling through. I figured it has to be legit to be Apache incubated, but I’ve never run across […]

November 17, 2020

Data, Data Engineering, Python, SQL

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql() . Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow […]

November 11, 2020

Data Engineering, Ramblings, Uncategorized

The Battlefield of the Data Engineer.

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a […]

November 2, 2020

Converting CSVs to Parquets… with Python and Scala.

Top 10 Data Engineering Blogs

Musings on Python’s map() and filter()

4 DataWarehouse-ish Functions For Your PySpark Dataframes

Intro to Apache Cassandra for Data Engineers

Database/SQL Fundamentals for Data Engineers

Scala with Text Files and ElasticSearch

Intro to Apache Beam for Data Engineers

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

The Battlefield of the Data Engineer.

Interesting links

Pages

Categories

Archive