The two coolest kids in class … I mean seriously … every other post in Data Engineering world these days is about Apache Airflow or DataBricks. It’s hard to kick against the goad. Just jump on the band wagon before you get left in the dust. I’ve used both DataBricks and Apache Airflow, they both are pretty important and integral tools for data engineers these days. Apache Airflow makes overall complex pipeline dependencies, orchestration, and management intuitive and easy. DataBricks has delivered with AWS and EMR could not, easy to use Spark and DeltaLake functionality without the management and config nightmares of running Spark yourself.

Recently I worked on an Airflow and DataBricks/DeltaLake integration, time to talk what it looks like and options when doing this type integration.

Read more

When I used to think of lambda functions on AWS my eyes would glaze over, I would roll my eyes and say, “I work with big data, what in the world can a silly little AWS lambda function offer me?” I’ve had to eat my own words, those little suckers come in handy in my day to day engineering work. I want to talk about how every data engineer working with AWS can take advantage of lambda’s and add them to their data pipeline tool belt.

Read more

Dagster, the first few times I read the name, I just couldn’t take the tech stack seriously …. it’s still kinda hard. Today I want to compare Airflow vs Dagster, mostly explore what Dagster is and does. But I want to compare it to the popular Apache Airflow project so people have some context for it. It’s kinda hard keeping up on all the new stuff these days, I usually just wait till I see enough articles and tweet floating around about it, then I know it’s maybe worth a peak. Let’s crack open Dagster, and see if it’s better then the name chosen for it.

Read more

Not going to lie. I’ve been trying to figure out for awhile where Apache Flink fits in the Data Engineering world for awhile now. A year or two ago I didn’t seem much content posted about it, but it seems to be picking up stream. I’ve mostly managed to avoid understanding what Flink is or does, but I figured it’s time to give my brain a much needed workout. When I was ignoring Flink, I just chalked it up as another messaging/streaming system like Kafka or Pulsar. Apparently I was wrong … no surprise there.

Read more

There are few things in life that are worse then cracking open some serious PySpark pipeline code, and then realizing there isn’t a single function written to encapsulate logic … wondering if some change you are about to make will bring down the whole pipeline. When you are new to a codebase you don’t know what you don’t know, you don’t have any backstory and you are usually flying by the seat of your pants in the beginning. When you have no unit tests, usually the only other way to test changes on a Spark pipeline are to run it …. which is sometimes easier said then done in a development environment. The first line of defence should be unit testing the entire PySpark pipeline.

Read More

If you’re anything like me when someone says Delta Lake you think DataBricks. But, the mythical Delta Lake is an open source project, available to anyone running Apache Spark. It seems also too good to be true, ACID transactions on the Spark scale? Incredible. This is the future, it has to be. The lines of what is a data warehouse have been starting to blur for a long time, I have a feeling Delta Lake will be the death blow to the traditional DW … or its rebirth??

Read more

In part 1 of the big data file formats we reviewed Parquet vs Avro. It was apparent from the start that the two file formats were built for different things. Avro is clearly a complex row structured file format used in communication and transactions, where schema is king and nested structures are no problem. Parquet on the other hand has risen to the top with the popularity of Spark, is columnar based storage and is well suited to structured and tabular type data. But, lest the annals of inter-webs call us uncouth and forgetful, we must add ORC file format to the list.

Read more

Don’t you like stuff for free? Don’t you like it when stuff I just handed to you? I mean when is that last time you didn’t want to get a free t-shirt. How about 20 bucks in the mail from you Grandma? That’s kinda what Pipelines are in Spark ML. The Apache Spark ML library is probably one of the easiest ways to get started into Machine Learning. Leaving all the fancy stuff to the Data Scientist is fine, Data Engineers are more interested in the end-to-end. The Pipeline, and the Spark ML API’s provide a straight froward path to building ML Pipelines that lower the bar for entry into ML. So, set right up, come get your free ML Pipeline.

Read more

With parquet taking over the big data world, as it should, and csv files being that third wheel that just will never go away…. it’s becoming more and more common to see the repetitive task of converting csv files into parquets. There are lots of reasons to do this, compression, fast reads, integrations with tools like Spark, better schema handling, and the list goes on. Today I want to see how many ways I can figure out how to simple convert an existing csv file to a parquet with Python, Scala, and whatever else there is out there. And how simple and fast each option is. Let’s do it!

Read more

I’ve always been surprised at the distinct lack of most Python code I’ve seen using the map() and filter() methods as standalone functions. I’ve always found them useful and easy to use, but I don’t often come across them in the wild, I’ve even been asked to remove them from my MR/PR’s, for no other reason then that they are supposedly ambiguous to some people? That’s got me thinking a lot about map() and filter() as related to readability, functional programming, side effects and other never ending debates where no one can even agree on the “correct” definition. Seriously. But, I will leave that rant for another time.

Read more