Home - Confessions of a Data Guy

How Chuck Norris Proved Async in Python isn’t Worthy.

There are some things I will never understand. Async in Python is one of them. Yes, sometimes I use it, but mostly because I’m bored and we all should have some kind of penance. Async in mine. It’s slow, confusing, other people get mad at you when they have to debug your Async code. I’ve […]

July 4, 2020

Data, Data Engineering, Scala

My Journey from Python to Scala – Part Deux

In Part 1 of my laborious journey from Python to Scala, I did some work with file operations, CSV files, and messing with the data. It took me a little longer then I expected to wrap my head around the Scala functional/object/immutable approach to software design. But, in the end if felt satisfying and I’m […]

June 22, 2020

Data, Data Engineering, Machine Learning, Python

Solving the Memory Hungry Pandas Concat Problem.

One of the greatest tools in Python is Pandas. It can read about any file format, gives you a nice data frame to play with, and provides many wonderful SQL like features for playing with data. The only problem is that Pandas is a terrible memory hog. Especially when it comes to concatenating groups of […]

June 8, 2020

Data, Data Engineering, Python, Scala

My Journey from Python to Scala – Part 1

UPDATE: If you want to know how my Scala SHOULD have been written. Check out this link! I feel like a frontiersmen heading west, into the unknown. I’ve been successful using Python as a Data Engineer for some time, processing terabytes of data with what “real” programmers sneer at as barely even a real language. […]

May 6, 2020

Python

The Utter Failure of Async in Python

I’m probably going to have to eat this blog post 2 years from now…. oh well. I still believe that Async has been mostly a failure since introduced in Python 3.4. Maybe I should be more specific, there seems to be a failure to adopt Async in the Python community and major packages at large. […]

April 16, 2020

Data, Data Engineering, Python, Uncategorized

Big Data File Showdown – Avro vs Parquet with Python.

There comes a point in the life of every data person that we have to graduate from csv files. At a certain point the data becomes big enough or we hear talk on the street about other file formats. Apache Parquet and Apache Avro are two of those formats that been coming up more with […]

April 5, 2020

Data, Data Engineering, Machine Learning, Python

Challenges of Machine Learning Pipelines at Scale… When You Don’t Work at Google.

ml pipelines Building Machine Learning (ML) pipelines with big data is hard enough, and it doesn’t take much of a curve ball to make it a nightmare. Most of what you will read online are tutorials on how to take a few CSV files and run them through some sklearn package. If you are lucky, […]

March 14, 2020

Data, Data Engineering, Python, Uncategorized

Apache Airflow for Data Engineers

On again, off again. I feel like that is the best way to describe Apache Airflow. It started out around 2014 at Airbnb and has been steadily gaining traction and usage ever since, albeit slowly. I still believe that Airflow is very underutilized in the data engineering community as a whole, most everyone has heard […]

January 11, 2020

Data Engineering, Python, SQL

Introduction to Postgres with Python

If there was ever a match made in heaven, it’s using Python and Postgres together. They were made for each other. Both are fun and easy to use, addicting, both have so many surprises and hidden gems. Like Gandalf and Frodo, the two just go together. Today I want to go through the basics of […]

December 30, 2019

Data, Data Engineering, Python

Exploring ElasticSearch with Python

What’s Elasticsearch precious? I feel like Gollum when confronted by taters. Elasticsearch has been around for awhile now, based on Lucene, it’s become a well known name in the field of text and semi structured data storage, analysis and retrieve category. Even though it’s popular enough to get name recognition I’ve rarely run across it […]

December 17, 2019

How Chuck Norris Proved Async in Python isn’t Worthy.

My Journey from Python to Scala – Part Deux

Solving the Memory Hungry Pandas Concat Problem.

My Journey from Python to Scala – Part 1

The Utter Failure of Async in Python

Big Data File Showdown – Avro vs Parquet with Python.

Challenges of Machine Learning Pipelines at Scale… When You Don’t Work at Google.

Apache Airflow for Data Engineers

Introduction to Postgres with Python

Exploring ElasticSearch with Python

Interesting links

Pages

Categories

Archive