, , ,

Deploying Apache Airflow inside Kubernetes.

Has anyone else noticed how popular Apache Airflow and Kubernetes have become lately? There is no better tool than Airflow for Data Engineers to built approachable and maintainable data pipelines. I mean Python, a nice UI, dependency graphs/DAGs. What more could you want? There is also no better tool than Kubernetes for building scalable, flexible data pipelines and hosting apps. Like a match made in heaven. So why not deploy Airflow onto Kubernetes? This is what you wish your mom would have taught you. It’s actually so easy your mom could probably do it….maybe she did do it and just never told you?

The first thing you will need is a Kubernetes cluster. I recommend using Linode, you can easily spin up a cheap cluster with the click of a button. Here is my $20/month cluster from Linode.

I’m going to assume you know something about Apache Airflow, I’ve written about it before. Simply put, an Airflow installation usually consists of a Scheduler, Airflow Workers, Web Server/UI, and optionally a Database. I’m going to assume you know something about Kubernetes. But what you may not know is how you can actually deploy something like Airflow inside Kubernetes.

Basics of Deploying Airflow inside Kubernetes.

Referencing the first diagram you saw…. we are going to need the following containers running inside Kubernetes.

  • Postgres Container
  • Postgres Service
  • Airflow Webserver
  • Airflow Scheduler
  • Airflow LoadBalancer Service

They’re a few obvious and not so obvious things about deploying any application inside Kubernetes if you haven’t done it much before.

  • You deploy PODs inside Kubernetes that will run/host different “pieces” of an application.
  • You use YAML files to describe the system you are trying to/will deploy onto Kubernetes.
  • Kubernetes will require something called a LoadBalancer to accept/ingress/route HTTP/Internet traffic from the outside world to inside the Kubernetes cluster and eventually to your “application.” If you require such a feature. (if you want to use Airflow UI you will need this.)
  • Deployed “applications”/PODs inside Kubernetes also will need a Service to communicate across nodes.
  • PODs/containers running on Kubernetes are just Docker images running some command.
  • You will need something called PersistentVolumes for Airflow to store its DAGs and Logs. (If a POD can come and go, crash etc, there is information you don’t want to lose.)
  • If you expose your application via LoadBalancer to the Internet, there are bad people on the internet, plan accordingly.

Step-By-Step – How to deploy Airflow inside Kubernetes.

All this code is available on my GitHub.

Step 1. Get Apache Airflow Docker image.

The PODs running your Apache Airflow on Kubernetes will need a docker image. Up until recently that was pain, you would have to build your own or use Puckel… stinking Puckel (if you know what I’m talking about… you cool). Finally the Airflow community released an official Docker Image. The latest version can be obtained by running ….

docker pull apache/airflow.

We should try to understand this Dockerfile and how it works if we plan to use it to run our Airflow inside Kubernetes.

If you run docker run -it docker.io/apache/airflow /bin/bash you will see an error that indicates that docker image is looking for a different command arg. Meaning when this POD runs on Kubernetes and the image is sent the “webserver” or “worker” command, that’ is what will airflow will run.

airflow command error: argument subcommand: invalid choice: '/bin/bash' (choose from 'backfill', 'list_dag_runs', 'list_tasks', 'clear', 'pause', 'unpause', 'trigger_dag', 'delete_dag', 'show_dag', 'pool', 'variables', 'kerberos', 'render', 'run', 'initdb', 'list_dags', 'dag_state', 'task_failed_deps', 'task_state', 'serve_logs', 'test', 'webserver', 'resetdb', 'upgradedb', 'checkdb', 'shell', 'scheduler', 'worker', 'flower', 'version', 'connections', 'create_user', 'delete_user', 'list_users', 'sync_perm', 'next_execution', 'rotate_fernet_key', 'config', 'info'), see help above.

Also, another question I had was would this Airflow image uses default configs (Airflow controls many setup options via a config file). So I would assume unless changed, Airflow would use sqlite as the database and not Postgres etc. Dropping in the image using `shell` seems to confirm this… docker run -it docker.io/apache/airflow shell

So, we have the official Apache Airflow Docker image, and we understand a little about how that image works. It requires certain commands to run our different pieces airflow scheduler or airflow webserver and the dang airflow initdb . (This last command setups up the database tables needed.)

Step 2. Deploy Postgres into Kubernetes

So this is the easy part. We need to get Postgres up and running inside Kuberentes. If you do a Google search for “postgres yaml kubernetes” this is a pretty common requirements and easy to find examples.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: postgres-airflow
spec:
  replicas: 1
  selector:
    matchLabels:
      deploy: postgres-airflow
  template:
    metadata:
      labels:
        name: postgres-airflow
        deploy: postgres-airflow
    spec:
      restartPolicy: Always
      containers:
        - name: postgres
          image: postgres
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 5432
              protocol: TCP
          volumeMounts:
            - name: dbvol
              mountPath: /var/lib/postgresql/data/pgdata
              subPath: pgdata
          env:
            - name: POSTGRES_USER
              value: airflow
            - name: POSTGRES_PASSWORD
              value: airflow
            - name: POSTGRES_DB
              value: airflow
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
            - name: POD_IP
              valueFrom: { fieldRef: { fieldPath: status.podIP } }
          livenessProbe:
            initialDelaySeconds: 60
            timeoutSeconds: 5
            failureThreshold: 5
            exec:
              command:
              - /bin/sh
              - -c
              - exec pg_isready --host $POD_IP ||  if [[ $(psql -qtAc --host $POD_IP 'SELECT pg_is_in_recovery') != "f" ]]; then  exit 0 else; exit 1; fi
          readinessProbe:
            initialDelaySeconds: 5
            timeoutSeconds: 5
            periodSeconds: 5
            exec:
              command:
              - /bin/sh
              - -c
              - exec pg_isready --host $POD_IP
          resources:
            requests:
              memory: .5Gi
              cpu: .5
      volumes:
        - name: dbvol
          emptyDir: {}

Just a side note. Most people would probably deploy their USERNAME and PASSWORD for the Postgres instance inside some Kubernetes Secrets. I won’t do that in my example. Also make note or the containerPort that is being exposed. Once you have this saved YAML file postgres-airflow.yaml, and have your kubectl connected to your Kubernetes cluster, run this command to deploy the Postgres instance.

kubectl apply -f postgres-airflow.yaml

After running this, you should be able to run kubectl get pods and see your Postgres POD running.

Step 3. Deploy a Service for Postgres.

For Kuberetes PODS to talk to each other across Nodes you need something called a Service. It’s network magic to make sure other PODs can talk to your database/Service. Note the ports referenced in this below YAML. They would need to be the same as the port defined in your postgres-airflow.yaml file.

kind: Service
apiVersion: v1
metadata:
  name: postgres-airflow
spec:
  selector:
    name: postgres-airflow
  ports:
  - name: postgres-airflow
    protocol: TCP
    port: 5432
    targetPort: 5432

Again, running kubectl apply -f postgres-service.yaml will deploy the Service. Running kubectl get services should yield the results.

Step 4. Prepare Postgres database for Airflow.

Thought you were done? Not yet. Right now we have a empty Postgres database running, it needs all the Airflow tables setup. In general the steps we will follow to do this are….

  • ssh into Kubernetes database Postgres POD.
  • install pip3 onto that POD.
  • use pip to install Python airflow package.
  • export an ENVIRONMENT VAR telling Airflow to connect to our Postgres database.
  • run airflow initdb command that will setup and initialize the database.

So here we go. First, ssh into your Postgres POD, using the name found from kubectl get pods , in my case the command was…

kubectl exec --stdin --tty postgres-airflow-5878785456-2knp7 -- /bin/bash

Once into the pod…

Run apt-get update
Run. apt get install python3-pip

Then run the following command once pip3 is installed…

pip3 install \
 apache-airflow[postgres]==1.10.10 \
 --constraint \
        https://raw.githubusercontent.com/apache/airflow/1.10.10/requirements/requirements-python3.7.txt

Next we want to export the Environment Variable that Airflow will recognize that points to our default Postgres database.

export AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql://airflow:airflow@localhost:5432/airflow

Now you can finally run the command which will setup all the database tables.
airflow initdb

Please note again I’m using the USERNAME and PASSWORD I put into my postgres-airflow.YAML file…. airflow and airflow. Those are needed for the above ENV VAR. Also, remember in a Production setup you would want those values set in some Secret store.

Once airflow-initdb has been run, you should see a lot of output being printed to your STDOUT.

Step 5. Get ready to write some YAML files.

We are getting closer now. The next step is going to be to actually write out the “deployment” YAML file that we will submit to Kubernetes describing what we want. It will create our Airflow scheduler and webserver. It will make sense when you see it. You will want to make sure you use the same names/namespaces across all your YAML files so everything works.

But first, we need to create a ConfigMap that will hold our database connection, so our Airflow PODS don’t try to connect to the default SQLite database, but the Postgres instance we setup. kubectl apply -f airflow-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: airflow-config
data:
  sql_alchemy_conn: "postgresql://airflow:airflow@postgres-airflow:5432/airflow"
  executor: "LocalExecutor"

The most important YAML file will be the Airflow deployment itself. You need two Airflow containers running…

  • scheduler
  • webserver

kubectl apply -f airflow-kubernetes.yaml

kind: Deployment
apiVersion: apps/v1
metadata:
  name: airflow
spec:
  replicas: 1
  selector:
    matchLabels:
      deploy: airflow
  template:
    metadata:
      labels:
        deploy: airflow
    spec:
      containers:
      - name: airflow-scheduler
        env:
          - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
            valueFrom:
              configMapKeyRef:
                name: airflow-config           # The ConfigMap this value comes from.
                key: sql_alchemy_conn # The key to fetch.
          - name: AIRFLOW__CORE__EXECUTOR
            valueFrom:
              configMapKeyRef:
                name: airflow-config
                key: executor
        image: docker.io/apache/airflow
        command: ["airflow"]
        args: ["scheduler"]
      - name: airflow-webserver
        env:
          - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
            valueFrom:
              configMapKeyRef:
                name: airflow-config           # The ConfigMap this value comes from.
                key: sql_alchemy_conn # The key to fetch.
          - name: AIRFLOW__CORE__EXECUTOR
            valueFrom:
              configMapKeyRef:
                name: airflow-config
                key: executor
        image: docker.io/apache/airflow
        ports:
        - containerPort: 8080
        command: ["airflow"]
        args: ["webserver"]
      restartPolicy: Always

After completing this step you should be able to see the PODs/containers running. kubectl describe pods

Step 6. Deploy a LoadBalancer Service to expose Airflow UI to Internet.

DANGER DANGER! Once this step is complete you will have an Airflow UI that anyone can access. You should probably secure it and control the IPs that can hit it.

This is the LoadBalancer described above that will route Internet traffic, (good and bad) into our Kubernetes cluster and point that traffic at port 8080 which will be our Airflow UI.

kubectl apply -f airflow-service.yaml

kind: Service
apiVersion: v1
metadata:
 name: airflow
spec:
 type: LoadBalancer
 ports:
 - port: 8080
   protocol: TCP
   targetPort: 8080
 selector:
  deploy: airflow

kubectl get services airflow

kubectl describe service airflow

You should now be able to reach your Airflow UI via http://{LoadBalancer Ingress}:{Port} .

There you have it. You now have Apache Airflow deployed and running inside Kubernetes.