In Part 1 of my laborious journey from Python to Scala, I did some work with file operations, CSV files, and messing with the data. It took me a little longer then I expected to wrap my head around the Scala functional/object/immutable approach to software design. But, in the end if felt satisfying and I’m starting to be a convert. Scala makes you think a little harden then Python, is less forgiving, and requires more of you as the developer. In part deux, I figured the next topic to grapple with some simple retrieval of remote files and writing those files to disk. Also, I wanted to take a crack at Classes in Scala.

Read more

One of the greatest tools in Python is Pandas. It can read about any file format, gives you a nice data frame to play with, and provides many wonderful SQL like features for playing with data. The only problem is that Pandas is a terrible memory hog. Especially when it comes to concatenating groups of data/data frames together (stacking/combing data). Just google “pandas concat memory issues” and you will see what I mean. Basically what it comes down to is that Pandas becomes quickly useless for anyone wanting to work on “big data” unless you want to spend $$$ on some machines in the cloud with tons-o-ram.

Read more

UPDATE: If you want to know how my Scala SHOULD have been written. Check out this link!

I feel like a frontiersmen heading west, into the unknown. I’ve been successful using Python as a Data Engineer for some time, processing terabytes of data with what “real” programmers sneer at as barely even a real language. Whatever. But, some of my favorite tools, like Spark, are written in Scala, and it’s on the rise, so I should probably join the lemmings in their mad dash. If for no other reason then to expand my horizons.

Read more
Apache Parquet vs Apache Avro

There comes a point in the life of every data person that we have to graduate from csv files. At a certain point the data becomes big enough or we hear talk on the street about other file formats. Apache Parquet and Apache Avro are two of those formats that been coming up more with the rise of distributed data processing engines like Spark.

Read more
Complexity is in the eye of the beholder.

ml pipelines

Building Machine Learning (ML) pipelines with big data is hard enough, and it doesn’t take much of a curve ball to make it a nightmare. Most of what you will read online are tutorials on how to take a few CSV files and run them through some sklearn package. If you are lucky, you might find some “big data” ML stories on Medium where someone uses Spark to crunch a bunch of JSON, Parquet, or CSV files at scale of 10 to a few hundred gigabytes of data. Usually they are simplistic and ambiguous. Unfortunately that isn’t how it works in the real world.

Read more

On again, off again. I feel like that is the best way to describe Apache Airflow. It started out around 2014 at Airbnb and has been steadily gaining traction and usage ever since, albeit slowly. I still believe that Airflow is very underutilized in the data engineering community as a whole, most everyone has heard of it, but it’s usage seems to be sporadic at best. I’m going to talk about what makes Apache Airflow the perfect tool for any Data Engineer, and show you how you can use it to great effect while not committing to it completely.

Read more
Exploring Elasticsearch with Python.

What’s Elasticsearch precious? I feel like Gollum when confronted by taters. Elasticsearch has been around for awhile now, based on Lucene, it’s become a well known name in the field of text and semi structured data storage, analysis and retrieve category. Even though it’s popular enough to get name recognition I’ve rarely run across it in the wild. We are going to dip our toes into Elasticsearch by working on a small project to store and search a book(s). It just give us enough simple problems to solve that by the end we should have at least a basic understanding of how to connect, store, and retrieve simple documents with Elasticsearch.

Read more

Ah. What a classic. The one piece of code that I end up writing over and over again, you would think I would have stashed it away by now. Not going to lie I usually have to Google it, while thinking, is this the right way? Should I just open the csv file and iterate it? Should I import the csv module? Should I just use Pandas? Does it matter? Probably not.

Read more
A fight to the death. A comparison of geo-spatial tools in Python. What’s easy and fast to use.

It’s a fight to the death people… that’s why it’s called Thunderdome. This will be no different. Last time we talked about the very basics of the strange world of geo-spatial tools for data engineering. The next most obvious thing do of course is to see what tool is the best. By best I mean what tools can be used to load and do simple manipulation of data in a fast and relatively simple manner.

Read more
Quick view of geospatial data landscape.

What does a data engineer need to know about working with geospatial data? I’m going to give my two cents on what is and is not important. First, prepare to be annoyed as you will most likely spend hours debugging strange and not obvious errors and bugs. You should run screaming the other way, but in case that is not a option, here are the basics.

Read more