"Lugvloei" is Afrikaans for "Airflow." I chose Afrikaans randomly to make the repository name unique. This project serves both as my learning documentation and as a playground for experimentation.
The project sets up a local data pipeline system orchestrated using Apache Airflow (referred to as Airflow in this documentation) that runs inside Kubernetes in Docker (kind) cluster.
- Build & Deployment
- The kind cluster runs inside a Docker container.
- The Airflow custom image is built, tagged, and then pushed to
kind-registry
so that the kind cluster can read the image. - Helm manages the installation and configuration of all applications inside the kind cluster.
- Orchestration
- Airflow runs inside the kind cluster.
- Airflow uses PostgreSQL as the metadata and result backend database that is defined in the airflow.yaml (it's not PostgreSQL in the High-Level Architecture). Read more about it here.
- Airflow uses Git-sync to sync Directed Acyclic Graph (DAG) files from GitHub. Read more about it here.
- Data Pipeline
- The data pipeline extracts data from a PostgreSQL database that runs inside the kind cluster, uploads the data into Google Cloud Storage (GCS) as a JSON file, and then loads it to BigQuery. The processes are executed using Python.
- Logging
- Airflow stores the DAG logs into GCS.
- Airflow sends notifications of DAG completion and failure to Slack.
- Docker (v27.4.0)
- Helm (v3.17.0)
- Personal Google Cloud Platform (GCP) project
- kind (v0.26.0)
- kubectl (v1.32.1)
- GNU Make (v3.81)
- Python (v3.11)
-
Fork this repository, then clone the forked repository to your device and open it using your favorite IDE.
-
Create
.env
file from the .env.template. You can use the example value forCLUSTER_NAME
,AIRFLOW_FERNET_KEY
, andAIRFLOW_WEBSERVER_SECRET_KEY
. But, if you want to have your own key, you can generate it using this guide forAIRFLOW_FERNET_KEY
and this guide forAIRFLOW_WEBSERVER_SECRET_KEY
. -
Create a GCS bucket, then replace the
<your-bucket-name>
placeholder in theAIRFLOW_REMOTE_BASE_LOG_FOLDER
value in the.env
file value to the created bucket name. -
Create a GCP service account, that has read and write access to GCS and BigQuery, and save the service account key as
serviceaccount.json
in thefiles/
directory. -
Update the
<your-github-username>
placeholder in theAIRFLOW_DAGS_GIT_SYNC_REPO
value in the.env
file to your GitHub username, and make sure you don't skip Step 1! -
(Optional) To make the Airflow dependencies available in your local device, execute the following scripts.
# Create Python virtual environment python -m venv venv # Activate the virtual environment source venv/bin/activate # Install base Airflow 2.9.3 with Python 3.11 dependencies pip install "apache-airflow==2.9.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.9.3/constraints-3.11.txt" # Install additional dependencies pip install -r airflow.requirements.txt
-
(Recommended) Adjust your Docker memory limit, set the limit to 8GB to avoid failure while installing the kind cluster.
-
Fill or use the default value for
POSTGRESQL_AUTH_DATABASE
,POSTGRESQL_AUTH_USERNAME
, andPOSTGRESQL_AUTH_PASSWORD
values in the.env
file. -
(Optional) Install any database manager. FYI, I'm using Beekeeper Studio as I write this documentation.
-
Provision the cluster.
make install
The following is the expected result.
Creating cluster "kind" ... ✓ Ensuring node image (kindest/node:v1.32.0) 🖼 ✓ Preparing nodes 📦 📦 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 ✓ Joining worker nodes 🚜 Set kubectl context to "kind-kind" You can now use your cluster with: kubectl cluster-info --context kind-kind Thanks for using kind! 😊 configmap/local-registry-hosting created namespace/airflow created secret/airflow-gcp-sa create
-
Build, tag, and push Airflow image to the
kind-registry
.make airflow-build
-
Install Airflow in the cluster.
make airflow-install
Check the pods.
kubectl get pods -n airflow --watch
⏳ Wait until the Airflow Webserver pod status changed to Running, then continue to the next step. The following is the expected result.
NAME READY STATUS RESTARTS AGE airflow-postgresql-0 1/1 Running 0 3m23s airflow-redis-0 1/1 Running 0 3m23s airflow-scheduler-556555fd95-7tnnn 3/3 Running 0 3m23s airflow-statsd-d76fb476b-zv4ms 1/1 Running 0 3m23s airflow-triggerer-0 3/3 Running 0 3m23s airflow-webserver-78d4758d7-jnhzl 1/1 Running 0 3m23s airflow-worker-0 3/3 Running 0 3m23s
-
Port-forward the Airflow Webserver to your local so you can open the Airflow Webserver UI using your browser.
make airflow-webserver-pf
Go to http://localhost:8080/ to check Airflow Webserver UI. Try to login using admin:admin if you didn't change the default credentials.
You should see this page after login.
-
Ensure you already fill in all the PostgreSQL-related var values in the
.env
file. -
Install PostgreSQL in the cluster.
make pg-install
Check the pods.
kubectl get pods -n postgresql --watch
⏳ Wait until the PostgreSQL pod status changed to Running, then continue to the next step. The following is the expected result.
NAME READY STATUS RESTARTS AGE postgresql-db-0 1/1 Running 0 3m39s
-
Port-forward the PostgreSQL database to your local so you can open the database using your favorite database manager.
make pg-pf
The following is the expected result.
kubectl port-forward svc/postgresql-db 5432:5432 --namespace postgresql Forwarding from 127.0.0.1:5432 -> 5432 Forwarding from [::1]:5432 -> 5432
-
Connect to the PostgreSQL database using your preferred method. Fill in the connection details using the value you used in step 8 in Environment Setup. Then, click the Test button. If you are also using Beekeeper Studio, it will look like this.
If the connection looks good, then click Connect button.
-
Copy and paste the query in PostgreSQL-DDL to the query window, and run it to create two tables and populate dummy data for each table in schema public.
-
Open Airflow Webserver UI, hover the Admin dropdown on the top of the UI, then click Connections.
-
If you are using the default values in the .env.template for your
.env
values, just add connection details below, otherwise, adjust the connection details to your.env
values.Connection Id: pg_lugvloei Connection Type: Postgres # The format for the host is <svc>.<namespace>.svc.cluster.local # To get the svc name, you can run `kubectl get svc -n postgresql` # You actually got the details previously when you run `make pg-install` Host: postgresql-db.postgresql.svc.cluster.local Database: lugvloei Login: postgres Password: postgres Port: 5432
-
Click the Test 🚀 button. You should see a green light above the connection details with the Connection successfully tested text.
-
Create a GCS bucket, then create a
GCS_DATA_LAKE_BUCKET
variable in Airflow, and use the name of the bucket you created as the value of the variable. -
Create a
google_cloud_default
connection in Airflow, only fill in the Project Id and Keyfile Path, then create a GCP_CONN_ID variable, and set google_cloud_default as the value of the variable. If you didn't change the default values related to the service account in the provision.sh or airflow.yaml, you can use/var/secrets/airflow-gcp-sa/serviceaccount.json
to fill the Keyfile Path value. -
Create a new dataset in BigQuery and name it
lugvloei
. Choose the region that is the same as the bucket you created previously. -
Enable the
pg_lugvloei_orders
andpg_lugvloei_users
DAGs in the Airflow Webserver UI.You can check the logs and other details by clicking the DAG name.
-
Check the result in your GCS bucket and BigQuery dataset.
Akh, there you go!