, ,

A Piece of DevOps that most Data Engineer’s Ignore.

I am always amused by the apparent contradictory nature of working in the world of data. There is always bits and pieces that come and go, the popular, the out of style … new technology driving new approaches and practices. One of the hot topics the last decade has produced is DevOps, a now staple of most every tech department. Like pretty much every other newish Software Engineering methodology, data world has struggled to adopt and keep pace with DevOps best practices. Once these is always a thorn in my side, making my life more difficult. The simplicity with which it can be adopted is amazing, and the unwillingness and lack of adoption is strange.

Dockerfiles for Data Engineers

The lack of a Dockerfile in any data pipeline and repo I explore tell’s me everything I need to know about the quality and setup of the codebase. Most folks in the data world live their life without it, thinking that containerization is for the software engineers of the world, but this is not the case. If anything the Data Engineering and Data Science worlds have more of a use case for Dockerfiles then most.

Why data needs Dockerfiles

It’s pretty common today for most Data Engineering/Data Science/ML workloads to be Python heavy. What’s the best and worst part about PyPI and Python packages? They are incredibly finicky, break easily, cause requirement conflicts, and require a large amount of magic to not break over time.

What else is common for data workloads and pipelines, relational databases and the connections that go with them. Anything else? An amazing number of command line tools.

Why could there be more reasons? I’m glad you asked, yes there are more reasons. Typical complex data pipelines and codebases require environment variables, configurations, specific directories and code layout.

This is what a Dockerfile is for. Why not make life for yourself and others easier? With a simple docker run or docker-compose up command everything that is need to run and test pipeline code is at your fingertips. All the setup complexity is written once and hidden away rarely to be messed with again.

Reasons to use a Dockerfile for data pipeline(s)

  • no surprise updates and breakages do to OS or package updates.
  • easier onboarding new engineers into the codebase
  • requirements, configuration, env vars all become easier to manage.
  • everyone is on the same page, no windows vs mac vs linux gotchas.
  • easier to transition code into distributed environments (think Kubernetes).
  • better DevOps (code deployment) and unit/integration testing.
  • makes you better at the command line (with makes you better in general)

Getting started with Dockerfiles

Getting started with Dockerfiles. First thing to do is install Docker desktop, easy to use and easy to install.

There are two (probably 3) options to write/use Dockerfiles for data pipelines. First, it’s good to understand docker hub , it’s where pretty much every project under the sun, plus some, stores official Dockerfile(s) for your use. Need to run Apache Spark? Why install it on your machine when you can get a Dockerfile with it already installed? Got a Python based project? Why not just use of the many Python Dockerfiles available.

These pre-build Dockerfiles can be obtained by a simple …

docker pull python  # or whatever else

The other option is to build your own Dockerfile, based on whatever OS you want, with whatever packages and tools you need …. even layered ontop of some Dockerfile from someone else.

Let’s take the example of someone who builds pipelines that run in AWS on Linux based images. You want a good development base that is as close as possible or exactly like production correct? So you build a Dockerfile that has say Python and Spark based on Linux with the aws cli installed.

FROM ubuntu:18.04

RUN apt-get update && \
    apt-get install -y default-jdk scala wget vim software-properties-common python3.8 python3-pip curl unzip libpq-dev build-essential libssl-dev libffi-dev

RUN wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz && \
    tar xvf spark-3.0.1-bin-hadoop3.2.tgz && \
    mv spark-3.0.1-bin-hadoop3.2/ /spark && \
    ln -s /spark spark

RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
    unzip awscliv2.zip && \
    ./aws/install

WORKDIR code
COPY . /code

RUN pip3 install -r requirements.txt

ENV MY_CODE=./code

It’s just an example but you get the point, defining a complex set of tools that won’t easily be broken that all developers and users of the pipeline can use is very simple and powerful way to make development, testing, and code usage easy for all.

Usually a Dockerfile written like this stored with the code can be built using a simple command…

docker build --tag my-special-image .

Also, make sure to read up on docker-compose. A great way to automate running tests and bits of code.

Musings

Dockerfiles are far from rocket science, they are probably one of the easiest things to learn, even as a new developer. Like anything else they can get complicated when running multiple services, but the basic usage of a Dockerfile will give you the 80% of what you need up front.

I also believe Dockerfiles in general force a more rigid development structure that is missing from a lot of data engineering code bases. When you find Dockerfiles you are more likely to find unit tests, documentation, requirements files, and generally better design patterns.

1 reply

Trackbacks & Pingbacks

Comments are closed.