Apache Airflow on GKE

Prerequisites

This article won’t go too much in details about the technology stacks we will use. We assume that you should have some knowledges and resources of the following.

  • Helm — We will use it to deploy the Apache Airflow on GKE.
  • Cloud DNS — We will use it to manage the DNS zone and ‘A’ record.
  • Domain — We will need to use it to point to Apache Airflow web server.
  • Google Workspace account — This article will use Google Workspace email for the single sign on into Apache Airflow. Actually, you can adapt to use other identity providers such as Okta or Auth0.
  • Kubernetes knowledge.

Source Code

The source codes relate to this article are kept publicly in the GitHub repository here — https://github.com/its-knowledge-sharing/setup-airflow-gke. I would recommend to clone the codes and read them at the same time you read this article.

Create the GKE cluster

I’m assuming that we are familiar to the Google Cloud Platform (GCP) and already have the GCP account. See this script for more detail 01-setup-gke.bash, run the script and wait until the GKE is created.

Create public IP address

We need to create the IP address that will be pointed to the Apache Airflow web server. Later this IP address will be used when we create the Ingress on GKE. To create the public IP address, please take a look this script 02–1-create-external-ip.bash.

Setup DNS zone and ‘A’ record

We will use this domain “airflow.napbiotec.io” in this article, there’re 2 reasons we need to have the domain.

  • We need our Apache Airflow web interface to use the CA certificate, we will use Google to issue the certificate for us. Actually, this is optional, we can use self-signed certificate instead but on GKE it is easier to use the certificates that signed by Google.
  • We need our Apache Airflow web interface to use Google identity provider in order to login to the web and I already have the Google Workspace account that can be used to demonstration the Apache Airflow OpenID feature.

Deploy Apache Airflow to GKE

Now it’s the time to deploy Apache Airflow to GKE. The easiest way to deploy it is by using Helm. The Apache Airflow also provided it’s own Helm chart here https://airflow-helm.github.io/charts. What we need to do is to create the Helm values file to customize what we need and run some Helm commands to get everything.

Create an Ingress to make it publicly accessible

We will create an Ingress to make Apache Airflow to publicly accessible from the internet. We will create the Ingress resource on GKE and GCP will automatically provision the HTTPs load balancer for us.

Authentication with Google Workspace accounts

We now want a new feature to add into our Apache Airflow, we want to use Google Workspace accounts to login into Apache Airflow. This will allow us to do the single sign on in our organization.

Use Git-Sync to deploy DAGs

One another feature we need to set is to use Git-Sync to deploy DAGs (ETL code to do data analytic for my case) to our Apache Airflow. What Git-Sync will do is to keep monitoring the Git repository on every specific period of time, if there are changes push into Git repository then Git-Sync will sync that changes to Apache Airflow for us. Yeah, the idea is similar to GitOps.

Supports

Congratulation!!! if you’ve read the entire article and it is able to help you solve your issues. You can support me by:

  • Follow me.
  • Share my articles.
  • Buy me a coffee via ADA address below if you want.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store