Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Feature/airflow depoy #1

Merged
merged 4 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .cfignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
logs/
34 changes: 34 additions & 0 deletions .devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"name": "Apache Airflow - sqlite",
"dockerComposeFile": [
"docker-compose.yaml"
],
"settings": {
"terminal.integrated.defaultProfile.linux": "bash"
},
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance",
"mtxr.sqltools",
"mtxr.sqltools-driver-pg",
"rogalmic.bash-debug",
"ms-azuretools.vscode-docker",
"dbaeumer.vscode-eslint",
"ecmel.vscode-html-css",
"timonwong.shellcheck",
"redhat.vscode-yaml",
"rogalmic.bash-debug"
],
"service": "airflow",
"forwardPorts": [
8080,
5555,
5432,
6379
],
"workspaceFolder": "/opt/airflow",
// for users who use non-standard git config patterns
// https://github.com/microsoft/vscode-remote-release/issues/2084#issuecomment-989756268
"initializeCommand": "cd \"${localWorkspaceFolder}\" && git config --local user.email \"$(git config user.email)\" && git config --local user.name \"$(git config user.name)\"",
"overrideCommand": true
}
3 changes: 3 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# AIRFLOW_UID=50000 ## this is UID within container, also reflected in docker-compose
AIRFLOW_UID=501
AIRFLOW_GID=0
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
logs/
**/__pycache__
55 changes: 55 additions & 0 deletions .profile
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash

##############################################################################
# NOTE: When adding commands to this file, be mindful of sensitive output.
# Since these logs are publicly available in github actions, we don't want
# to leak anything.
##############################################################################

set -o errexit
set -o pipefail

echo "airflow config setup..."

function vcap_get_service () {
local path name
name="$1"
path="$2"
#TODO FIX THIS
service_name=test-airflow-${name}
echo $VCAP_SERVICES | jq --raw-output --arg service_name "$service_name" ".[][] | select(.name == \$service_name) | $path"
}

export APP_NAME=$(echo $VCAP_APPLICATION | jq -r '.application_name')

# # Create a staging area for secrets and files
# CONFIG_DIR=$(mktemp -d)
# SHARED_DIR=$(mktemp -d)

# Extract credentials from VCAP_SERVICES
export REDIS_HOST=$(vcap_get_service redis .credentials.host)
export REDIS_PASSWORD=$(vcap_get_service redis .credentials.password)
export REDIS_PORT=$(vcap_get_service redis .credentials.port)

export AIRFLOW__CELERY__BROKER_URL="$(vcap_get_service redis .credentials.uri)/0"
export BROKER_URL=$AIRFLOW__CELERY__BROKER_URL

AIRFLOW__CELERY__RESULT_BACKEND="db+$(vcap_get_service db .credentials.uri)"
export AIRFLOW__CELERY__RESULT_BACKEND=${AIRFLOW__CELERY__RESULT_BACKEND/'postgres'/'postgresql+psycopg2'}
# export AIRFLOW__CELERY__RESULT_BACKEND=$AIRFLOW__CELERY__BROKER_URL

export FLOWER_PORT="$PORT"
# export SAML2_PRIVATE_KEY=$(vcap_get_service secrets .credentials.SAML2_PRIVATE_KEY)

# remote s3 for logs
export AIRFLOW__LOGGING__REMOTE_LOGGING="true"
export AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID="s3conn" # name of conn id in web ui?
# export AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=$(vcap_get_service s3 .credentials.uri)
export AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER="s3://$(vcap_get_service s3 .credentials.endpoing)/$(vcap_get_service s3 .credentials.bucket)/logs"
export AIRFLOW__LOGGING__ENCRYPT_S3_LOGS="false"

AIRFLOW__DATABASE__SQL_ALCHEMY_CONN="$(vcap_get_service db .credentials.uri)"
export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=${AIRFLOW__DATABASE__SQL_ALCHEMY_CONN/'postgres'/'postgresql+psycopg2'}

# TODO connections can be provided here:
# https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#storing-connections-in-environment-variables
87 changes: 87 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
This Airflow ETL test was built with the following guides:

## Background

- https://towardsdatascience.com/run-airflow-docker-1b83a57616fb
- for airflow setup
- https://www.freecodecamp.org/news/orchestrate-an-etl-data-pipeline-with-apache-airflow/
- TODO: still need to wire in actual services
- https://davidgriffiths-data.medium.com/debugging-airflow-in-a-container-with-vs-code-7cc26734444

## Setup

1. Initialize the Airflow client:

```
docker-compose up airflow-init
```

2. After that completes successfully, you start your containers as normal:

```
docker-compose up
```

3. Access the Airflow UI by visiting: `localhost:8080` using user:password :: `airflow:airflow`

## Debugging Remote Container

(using VS Code)

1. Install Dev Containers extension `ms-vscode-remote.remote-containers`
2. After you have installed the Remote — Container extension you can open up VS Code’s command pallet (ctrl+shift+p) and type: `Remote-Containers: Attach to a Running Container…`
3. Attach to `airflow-scheduler`
4. Open a terminal.
4.1 If you receive the error, then you need to modify your dev container settings:

```
The terminal process failed to launch: Path to shell executable "/sbin/nologin" does not exist.
```

4.2 Run `Remote-Containers: Open Container Configuration File` from the Command Palette after attaching.
4.3 Add `"remoteUser": "airflow"` to the JSON
4.4 Close the Container window and reattach
4.5 You should now be able to open a terminal 5. Select the correct Python interpreter by opening the command pallete and choosing the global python executable instead of the recommended one.
5.1 NOTE: This fixes the error you may encounter when when running the debugger:

```
Exception has occurred: ModuleNotFoundError
No module named 'airflow'
File "/home/airflow/.local/bin/airflow", line 5, in <module>
from airflow.__main__ import main
ModuleNotFoundError: No module named 'airflow'
```

5. Create a launch.json. This configuration worked for me:

```
{
"version": "0.2.0",
"configurations": [
{
"name": "Airflow Test",
"type": "python",
"request": "launch",
"program": "/home/airflow/.local/bin/airflow",
"console": "integratedTerminal",
"args": [
"dags",
"test",
"etl_twitter_pipeline",
"2023-08-17"
]
}
]
}
```

5.1 Note that the last the last arg is based on the date YYYY-MM-DD or your log records in the `airflow/logs/scheduler` directory. This is only generated after running the `etl_twitter_pipeline` DAG in the UI and allowing it to dump some logs.

## TODO

- seed proper `devcontainer.json` file into container
- https://betterprogramming.pub/running-a-container-with-a-non-root-user-e35830d1f42a


## harvest sources catalog api query example:
- https://catalog.data.gov/api/action/package_search?fq=dataset_type:harvest
Loading