With parquet taking over the big data world, as it should, and csv files being that third wheel that just will never go away…. it’s becoming more and more common to see the repetitive task of converting csv files into parquets. There are lots of reasons to do this, compression, fast reads, integrations with tools […]

I’ve always been surprised with the rise of data engineering and big data, how hard it is to find good data engineering content that is somewhat regular. Tech moves fast and I feel like data engineering moves even faster. There are always new tools and systems coming out with regular frequency, it’s hard to keep […]

I’ve always been surprised at the distinct lack of most Python code I’ve seen using the map() and filter() methods as standalone functions. I’ve always found them useful and easy to use, but I don’t often come across them in the wild, I’ve even been asked to remove them from my MR/PR’s, for no other […]

Ever felt like just exploring documentation… seeing what you can find? That’s what you do on a cold, first snowstorm of the year Sunday afternoon. After the initial fun has warn off, the kids don’t want to go outside anymore, and Netflix has nothing new to offer up. So I thought I might as well […]

Hmm… yet another distributed database …. will it ever end? Probably not. It’s hard to keep up with them all, even the old ones. That brings me to Apache Cassandra. Of all the popular big data distributed databases Cassandra seems to be kind of that student who always sits in the back row and never […]

I’ve meet my fair share of snooty people who poo poo SQL and databases as second class hand-me-downs. I still remember talking to an academic computer science grad who was explaining to me how he refused to teach database classes, he was just too good for that. Whatever. Apparently refusing to accept how 90% of […]

I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in […]

What is this thing? What’s it good for? Who’s using it and why? That’s pretty much what I ask myself once a month when I actually see the name Apache Beam pop up in some feed I’m scrolling through. I figured it has to be legit to be Apache incubated, but I’ve never run across […]

I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql() . Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow […]

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a […]