Apache Airflow on GKE
Last week I got the request from Data Engineers in my company that they need a platform they can use to run the ETL codes for their data analytic missions. After the brainstorms, we agreed to use Apache Airflow for many reasons that I will not explain here. There are a lot of explanations in the internet about the benefit of using Apache Airflow, this link is an example.
The next question is, how do we setup it (single VM vs Kubernetes)? For the management reasons so we made the decision to deploy it on GKE and this is the reason I write this article.
In this article we will deploy Apache Airflow to GKE cluster by using Helm. We then will configure it to use Google account email (not Gmail) for the authentication. furthermore, we will use the Git-Sync feature to automatically sync DAGs codes from GitHub to our Apache Airflow.
This article won’t go too much in details about the technology stacks we will use. We assume that you should have some knowledges and resources of the following.
- Helm — We will use it to deploy the Apache Airflow on GKE.
- Cloud DNS — We will use it to manage the DNS zone and ‘A’ record.
- Domain — We will need to use it to point to Apache Airflow web server.
- Google Workspace account — This article will use Google Workspace email for the single sign on into Apache Airflow. Actually, you can adapt to use other identity providers such as Okta or Auth0.
- Kubernetes knowledge.
This article will use domain “airflow.napbiotec.io” for the demonstration. The napbiotec.io is my family owned business, so, if you’re looking for the high quality herbal extracts of mitragyna speciosa (Kratom) please visit us.
The source codes relate to this article are kept publicly in the GitHub repository here — https://github.com/its-knowledge-sharing/setup-airflow-gke. I would recommend to clone the codes and read them at the same time you read this article.
Create the GKE cluster
I’m assuming that we are familiar to the Google Cloud Platform (GCP) and already have the GCP account. See this script for more detail 01-setup-gke.bash, run the script and wait until the GKE is created.
Create public IP address
We need to create the IP address that will be pointed to the Apache Airflow web server. Later this IP address will be used when we create the Ingress on GKE. To create the public IP address, please take a look this script 02–1-create-external-ip.bash.
Once the IP address is created, we will see it in the GCP console as show in the picture below. Please note that we assign the name ingress-ip-1 to the IP address and later we will refer to it by using its name rather than the actual IP address.
Setup DNS zone and ‘A’ record
We will use this domain “airflow.napbiotec.io” in this article, there’re 2 reasons we need to have the domain.
- We need our Apache Airflow web interface to use the CA certificate, we will use Google to issue the certificate for us. Actually, this is optional, we can use self-signed certificate instead but on GKE it is easier to use the certificates that signed by Google.
- We need our Apache Airflow web interface to use Google identity provider in order to login to the web and I already have the Google Workspace account that can be used to demonstration the Apache Airflow OpenID feature.
Let’s start by creating the DNS managed zone in the Cloud DNS first. The script to create one can be found here 02–2-create-dns-zone.bash.
Next we will create a “A” record for “airflow.napbiotec.io” by running this script 02–3-create-a-record.bash.
Once the DNS manage zone and “A” record are created, we will see them in the GCP console as shown below. Note that the the IP address of “airflow.napbiotec.io” is the one we created earlier.
In order to make our domain (“airflow.napbiotec.io” in this case) resolvable, we will need to copy its NS record and paste it into another NS record of the top level domain (“napbiotec.io” in this case).
Deploy Apache Airflow to GKE
Now it’s the time to deploy Apache Airflow to GKE. The easiest way to deploy it is by using Helm. The Apache Airflow also provided it’s own Helm chart here https://airflow-helm.github.io/charts. What we need to do is to create the Helm values file to customize what we need and run some Helm commands to get everything.
The script to deploy Apache Airflow can be found here 03–1-deploy-airflow.bash
Once Apache Airflow is successfully deployed, we will see the Pods and Services similarly as shown in the pictures below.
Create an Ingress to make it publicly accessible
We will create an Ingress to make Apache Airflow to publicly accessible from the internet. We will create the Ingress resource on GKE and GCP will automatically provision the HTTPs load balancer for us.
Please take a look this script 03–2-deploy-ingress.bash to see how we create the Ingress. It also shows us how we use ManagedCertificate resource to request Google to signed the certificate for us.
Please make sure that the domain is resolvable to IP address created earlier, otherwise Google will not approve the submitted certificate signing request.
Once the Ingress is created, we will see the result similarly to the picture shown below.
GCP HTTPs load balancer should be created automatically.
If everything is OK, we should be able to navigate to this link https://airflow.napbiotec.io/ (I will remove this link very soon) and the login page should be shown.
The issued certificate is also trusted by browsers.
Try logging in with user “admin” and password “admin”, this is the default user/password. Later we will configure the authentication to use Google Workspace account instead.
Authentication with Google Workspace accounts
We now want a new feature to add into our Apache Airflow, we want to use Google Workspace accounts to login into Apache Airflow. This will allow us to do the single sign on in our organization.
We first need to create the OAuth2 Client ID in the GCP console, I won’t go too much in details for this. Actually, you can use another identity provider such as Okta or Auth0. What we really need are only Client ID and Client Secret but keep in mind that we will need to slightly change the configuration in this file airflow-2-google-oidc.yaml to match your identity provider.
Below picture is my OAuth2 Client ID will be used in this article, I will remove it soon from my GCP. Later the Client ID and Client Secret will be used to create Kubernetes Secret which will again used by Apache Airflow.
Now let’s use this script 04–1-create-oidc-secret.bash to create the Kubernetes Secret that holds the Client ID and Client Secret.
Note that in the code below we will need to manually populate 2 environment variables OIDC_CLIENT_ID and OIDC_CLIENT_SECRET before running the script. I don’t want to hardcode them in my the source code.
We should see the Secret google-oidc created as shown in the picture below.
We can use External Secret & Google Secret Manager to keep our secrets, this will make our deployment to be more automate. Please see more details in my another article here.
Now run this script 04–2-update-airflow-oidc.bash to redeploy Apache Airflow with the new configuration that supports authentication with Google. What the script will do is the only calling Helm and pass the new values to it, the new Helm values file can be found here airflow-2-google-oidc.yaml.
Wait for no longer than 5 minutes until all Pod’s status is “running” and then navigate your browser to https://airflow.napbiotec.io/. This time you browser will redirect to Google login page first as show in the picture below.
Once we’re done with the Google authentication process, we then will be redirected back to our Apache Airflow landing page.
Note that there is a “Access id Denied” error message, I don’t really know what is the root cause of this, this might be a tiny bug. Just refresh your browser and the error message will disappear.
Use Git-Sync to deploy DAGs
One another feature we need to set is to use Git-Sync to deploy DAGs (ETL code to do data analytic for my case) to our Apache Airflow. What Git-Sync will do is to keep monitoring the Git repository on every specific period of time, if there are changes push into Git repository then Git-Sync will sync that changes to Apache Airflow for us. Yeah, the idea is similar to GitOps.
Please run this script 04–3-update-airflow-gitsync.bash to redeploy Apache Airflow with the new configuration that supports Git-Sync feature. What the script will do is the only calling Helm and pass the new values to it, the new Helm values file can be found here airflow-3-gitsync-public.yaml.
If everything is OK, we will see Pods as shown in the picture below. Please note that we will see the “2/2” in the READY column of some Pods. This indicates that the Git-Sync sidecar container is added to Pods.
The snippet below is only for the configuration that relates to Git-Sync. In this article we use public GitHub repository and DAGs code from here. If we want to sync from public Git repository, I would recommend to use https scheme instead of ssh.
The way to use Git-Sync with private Git repository is out of scope for this article, it is not too difficult to configure. If you really need me to demonstrate how to do, please put your commend.
Take a look the Apache Airflow web interface again, we will see there are many DAGs there.
Congratulation!!! if you’ve read the entire article and it is able to help you solve your issues. You can support me by:
- Follow me.
- Share my articles.
- Buy me a coffee via ADA address below if you want.