Home - Confessions of a Data Guy

Performance Testing Postgres Inserts with Python

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I […]

December 17, 2021

Big Data, Data, Data Engineering, Data Warehousing

Hive Metastore in Databricks – What To Know.

Hive is like the zombie apocalypse of the Big Data world, it can’t be killed, it keeps coming back. More specifically the lesser-known Hive Metastore is the little sneaker that has wormed its way into a lot of Big Data tooling and platforms, in a quasi behind the scenes way. Many people don’t realize it, […]

December 5, 2021

Big Data, Data, Data Engineering, Data Warehousing

Lessons Learned from MERGE operations with Billions of Records on Databricks Spark

Something happens with you starting working with 10’s of billions of records and data sets that are hundreds of TBs in size. Do you know what happens? Things stop working, that’s what. I miss the days where 1-10 TBs were considered large and in charge. the good ole days. I want to talk about lessons […]

December 1, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

CTE vs SubQuery

What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. […]

November 18, 2021

Data, Data Engineering, Python, SQL

ORM’s are the Cigarettes of the Data Engineering World.

Seriously, just don’t do it, they are bad for you. Listen to your mother, just say no. The dreaded ORM’s ( Object Relational Mapping ) that do all the hard SQL work for you. But, they come with many unintended consequences that are bad for your health and wellness in the long term. Many unsuspecting […]

November 15, 2021

Big Data, Data, Data Engineering, Python

3 Tips for Unit Testing PySpark Pipelines

I’m not sure what it is, but some prevailing evil in the Data Engineering world has made it not so common for PySpark pipelines to be unit tested. Who knows, it’s probably a combination of things. Data Engineers have been accused of not having good Software Engineering principles. Functional testing is a hot commodity in […]

November 9, 2021

Big Data, Data, Data Engineering, Data Warehousing

6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Databricks, easily the hotest tool these days for Data Lakes and Data Warehousing, it’s a beast. As with any new technology there are always growing pains, learnings, and tips and tricks that might not be obvious to those dipping their toes in the water. Not understand certain concepts, and being unware of specific configurations can […]

October 29, 2021

Big Data, Data, Data Engineering, Data Warehousing

Data Modeling – Relational Databases (SQL) vs Data Lake (File Based)

Data Modeling is a topic that never goes away. Sometimes I do reminisce about the good ol’ days of Kimball-style data models, it was so simple, straightforward, just the same thing for years. Then Big Data happened, Spark happened. Things just changed. There is a lot of new content coming out around Data Lakes and […]

October 20, 2021

Big Data, Data, Data Engineering, Python, Ramblings

Review of Airbyte for Data Engineers

It’s hard to keep up with the never-ending stream of new Data Engineering tools these days. Always something new around the next bend. I find it interesting to kick the tries on the new kids on the block. It’s always interesting to see what angle or pain point a new tool tries to hone in […]

October 13, 2021

Big Data, Data, Data Engineering, Python

Bitwise Operations for Data Engineers

Ugh. Cursed bitwise operations … something usually reserved for the low-level mythical engineers writing code no one should have to write. I’ve escaped all but twice during my meager existence, recently I had to use a bitwise operation while converting a Python hashing algorithm into PySpark code. It made my brain hurt. What is this […]

October 13, 2021

Performance Testing Postgres Inserts with Python

Hive Metastore in Databricks – What To Know.

Lessons Learned from MERGE operations with Billions of Records on Databricks Spark

CTE vs SubQuery

ORM’s are the Cigarettes of the Data Engineering World.

3 Tips for Unit Testing PySpark Pipelines

6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Data Modeling – Relational Databases (SQL) vs Data Lake (File Based)

Review of Airbyte for Data Engineers

Bitwise Operations for Data Engineers

Interesting links

Pages

Categories

Archive