Lugvloei

Background

"Lugvloei" is Afrikaans for "Airflow." I chose Afrikaans randomly to make the repository name unique. This project serves both as my learning documentation and as a playground for experimentation.

High-Level Architecture

The project sets up a local data pipeline system orchestrated using Apache Airflow (referred to as Airflow in this documentation) that runs inside Kubernetes in Docker (kind) cluster.

Build & Deployment
- The kind cluster runs inside a Docker container.
- The Airflow custom image is built, tagged, and then pushed to kind-registry so that the kind cluster can read the image.
- Helm manages the installation and configuration of all applications inside the kind cluster.
Orchestration
- Airflow runs inside the kind cluster.
- Airflow uses PostgreSQL as the metadata and result backend database that is defined in the airflow.yaml (it's not PostgreSQL in the High-Level Architecture). Read more about it here.
- Airflow uses Git-sync to sync Directed Acyclic Graph (DAG) files from GitHub. Read more about it here.
Data Pipeline
- The data pipeline extracts data from a PostgreSQL database that runs inside the kind cluster, uploads the data into Google Cloud Storage (GCS) as a JSON file, and then loads it to BigQuery. The processes are executed using Python.
Logging
- Airflow stores the DAG logs into GCS.
- Airflow sends notifications of DAG completion and failure to Slack.

Setup & Installation

Disclaimer

⚠️ I tested this setup guide only on macOS Sequoia 15.0.1. If you are using a different OS, you might need to adjust several things.

Prerequisites

Docker (v27.4.0)
Helm (v3.17.0)
Personal Google Cloud Platform (GCP) project
kind (v0.26.0)
kubectl (v1.32.1)
GNU Make (v3.81)
Python (v3.11)

Steps

Environment Setup

Fork this repository, then clone the forked repository to your device and open it using your favorite IDE.
Create .env file from the .env.template. You can use the example value for CLUSTER_NAME, AIRFLOW_FERNET_KEY, and AIRFLOW_WEBSERVER_SECRET_KEY. But, if you want to have your own key, you can generate it using this guide for AIRFLOW_FERNET_KEY and this guide for AIRFLOW_WEBSERVER_SECRET_KEY.
Create a GCS bucket, then replace the <your-bucket-name> placeholder in the AIRFLOW_REMOTE_BASE_LOG_FOLDER value in the .env file value to the created bucket name.
Create a GCP service account, that has read and write access to GCS and BigQuery, and save the service account key as serviceaccount.json in the files/ directory.
Update the <your-github-username> placeholder in the AIRFLOW_DAGS_GIT_SYNC_REPO value in the .env file to your GitHub username, and make sure you don't skip Step 1!

(Optional) To make the Airflow dependencies available in your local device, execute the following scripts.

# Create Python virtual environment
python -m venv venv
# Activate the virtual environment
source venv/bin/activate
# Install base Airflow 2.9.3 with Python 3.11 dependencies
pip install "apache-airflow==2.9.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.9.3/constraints-3.11.txt"
# Install additional dependencies
pip install -r airflow.requirements.txt

(Recommended) Adjust your Docker memory limit, set the limit to 8GB to avoid failure while installing the kind cluster.
Fill or use the default value for POSTGRESQL_AUTH_DATABASE, POSTGRESQL_AUTH_USERNAME, and POSTGRESQL_AUTH_PASSWORD values in the .env file.
(Optional) Install any database manager. FYI, I'm using Beekeeper Studio as I write this documentation.

Cluster & Airflow Installation

Provision the cluster.

make install

The following is the expected result.

Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.32.0) 🖼
✓ Preparing nodes 📦 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊
configmap/local-registry-hosting created
namespace/airflow created
secret/airflow-gcp-sa create

Build, tag, and push Airflow image to the kind-registry.
```
make airflow-build
```

Install Airflow in the cluster.

make airflow-install

Check the pods.

kubectl get pods -n airflow --watch

⏳ Wait until the Airflow Webserver pod status changed to Running, then continue to the next step. The following is the expected result.

NAME                                 READY   STATUS    RESTARTS   AGE
airflow-postgresql-0                 1/1     Running   0          3m23s
airflow-redis-0                      1/1     Running   0          3m23s
airflow-scheduler-556555fd95-7tnnn   3/3     Running   0          3m23s
airflow-statsd-d76fb476b-zv4ms       1/1     Running   0          3m23s
airflow-triggerer-0                  3/3     Running   0          3m23s
airflow-webserver-78d4758d7-jnhzl    1/1     Running   0          3m23s
airflow-worker-0                     3/3     Running   0          3m23s

Port-forward the Airflow Webserver to your local so you can open the Airflow Webserver UI using your browser.
```
make airflow-webserver-pf
```
Go to http://localhost:8080/ to check Airflow Webserver UI. Try to login using admin:admin if you didn't change the default credentials.

You should see this page after login.

PostgreSQL Installation

Ensure you already fill in all the PostgreSQL-related var values in the .env file.
Install PostgreSQL in the cluster.
```
make pg-install
```
Check the pods.
```
kubectl get pods -n postgresql --watch
```
⏳ Wait until the PostgreSQL pod status changed to Running, then continue to the next step. The following is the expected result.
```
NAME              READY   STATUS    RESTARTS   AGE
postgresql-db-0   1/1     Running   0          3m39s
```

Populating the PostgreSQL Database

Port-forward the PostgreSQL database to your local so you can open the database using your favorite database manager.

make pg-pf

The following is the expected result.

kubectl port-forward svc/postgresql-db 5432:5432 --namespace postgresql
Forwarding from 127.0.0.1:5432 -> 5432
Forwarding from [::1]:5432 -> 5432

Connect to the PostgreSQL database using your preferred method. Fill in the connection details using the value you used in step 8 in Environment Setup. Then, click the Test button. If you are also using Beekeeper Studio, it will look like this.

If the connection looks good, then click Connect button.
Copy and paste the query in PostgreSQL-DDL to the query window, and run it to create two tables and populate dummy data for each table in schema public.

Connecting Airflow With PostgreSQL

Open Airflow Webserver UI, hover the Admin dropdown on the top of the UI, then click Connections.

If you are using the default values in the .env.template for your .env values, just add connection details below, otherwise, adjust the connection details to your .env values.

Connection Id: pg_lugvloei
Connection Type: Postgres
# The format for the host is <svc>.<namespace>.svc.cluster.local
# To get the svc name, you can run `kubectl get svc -n postgresql`
# You actually got the details previously when you run `make pg-install`
Host: postgresql-db.postgresql.svc.cluster.local
Database: lugvloei
Login: postgres
Password: postgres
Port: 5432

Click the Test 🚀 button. You should see a green light above the connection details with the Connection successfully tested text.

Creating [PostgreSQL -> GCS -> BigQuery] DAGs

Create a GCS bucket, then create a GCS_DATA_LAKE_BUCKET variable in Airflow, and use the name of the bucket you created as the value of the variable.
Create a google_cloud_default connection in Airflow, only fill in the Project Id and Keyfile Path, then create a GCP_CONN_ID variable, and set google_cloud_default as the value of the variable. If you didn't change the default values related to the service account in the provision.sh or airflow.yaml, you can use /var/secrets/airflow-gcp-sa/serviceaccount.json to fill the Keyfile Path value.
Create a new dataset in BigQuery and name it lugvloei. Choose the region that is the same as the bucket you created previously.
Enable the pg_lugvloei_orders and pg_lugvloei_users DAGs in the Airflow Webserver UI.

You can check the logs and other details by clicking the DAG name.
Check the result in your GCS bucket and BigQuery dataset.

Akh, there you go!

Name		Name	Last commit message	Last commit date
Latest commit History 816 Commits
dags		dags
docs		docs
files		files
helm/values		helm/values
k8s		k8s
utilities		utilities
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
airflow.Dockerfile		airflow.Dockerfile
airflow.requirements.txt		airflow.requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lugvloei

Background

High-Level Architecture

Setup & Installation

Disclaimer

Prerequisites

Steps

Environment Setup

Cluster & Airflow Installation

PostgreSQL Installation

Populating the PostgreSQL Database

Connecting Airflow With PostgreSQL

Creating [PostgreSQL -> GCS -> BigQuery] DAGs

About

Languages

License

enchant3dmango/lugvloei

Folders and files

Latest commit

History

Repository files navigation

Lugvloei

Background

High-Level Architecture

Setup & Installation

Disclaimer

Prerequisites

Steps

Environment Setup

Cluster & Airflow Installation

PostgreSQL Installation

Populating the PostgreSQL Database

Connecting Airflow With PostgreSQL

Creating [PostgreSQL -> GCS -> BigQuery] DAGs

About

Topics

Resources

License

Stars

Watchers

Forks

Languages