This is a data pipeline using Apache Airflow to rebuild the geocint mapaction pipeline.
This table summarises the Geocint Makefile recipie's prerequisites. This have been remade / mapped in the POC pipeline with the same names, and this is currently working (although only 3 steps have actually been implemented.)
Stage name | Prerequisites | What it does / scripts it calls |
---|---|---|
Export country | NA - PHONY | Extracts polygon via osmium Osm data import (import from ? To protocol buffer format) Mapaction data table (upload to Postgres in new table) Map action export (creates .shp and .json files for a list of ? [countries? Counties? Other?]) Mapaction upload cmf (uploads shp+tiff and geojson+tiff to s3, via ?cmf) |
All | Dev | |
Dev | upload_datasets_all, upload_cmf_all, create_completeness_report | slack_message.py |
. | ||
upload_datasets_all | datasets_ckan_descriptions | mapaction_upload_dataset.sh - Creates a folder and copies all .shp, .tif and .json files into it. mapaction_upload_dataset.sh - Zips it Creates a folder called “/data/out/country_extractions/<country_name>” in S3, and copies the zip folder into it. |
upload_cmf_all | cmf_metadata_list_all | See above by export country |
datasets_ckan_descriptions | datasets_all | mapaction_build_dataset_description.sh - |
datasets_all | ne_10m_lakes, ourairports, worldports, wfp_railroads, global_power_plant_database, ne_10m_rivers_lake_centerlines, ne_10m_populated_places, ne_10m_roads, healthsites, ocha_admin_boundaries, mapaction_export, worldpop1km, worldpop100m, elevation, download_hdx_admin_pop | |
ne_10m_lakes | ||
ourairports | ||
worldports | ||
wfp_railroads | ||
global_power_plant_database | ||
ne_10m_rivers_lake_centerlines | ||
ne_10m_populated_places | ||
ne_10m_roads | ||
Health sites | ||
ocha_admin_boundaries | ||
mapaction_export | ||
worldpop1km | ||
Elevation | ||
download_hdx_admin_pop |
- Currently trying AWS's managed service.
- See here for how to install non-python dependencies (e.g. gdal)
- If you get "Dag import errors", the actual errors are in the logs directory/volume
- To add new python packages
- Stop all docker containers
- Remove containers and images with
docker container prune
and delete the docker image for the airflow worker. - Update your requirements.txt file with
pip freeze > requirements.txt
- Run
docker compose up
- Install docker and docker compose
- Follow the Airflow steps for using docker compose here.
- Note, you may need to force recreating images, e.g.
docker compose up --force-recreate
- The default location for the webserver is http://localhost:8080. The default username and password is
airflow
.
- Ensure you have the AIRFLOW_UID with this command
mkdir -p ./dags ./logs ./plugins ./config
echo -e "AIRFLOW_UID=$(id -u)" > .env
echo "WEBDAV_HOSTNAME=" >> .env
echo "WEBDAV_LOGIN=" >> .env
echo "WEBDAV_PASSWORD=" >> .env
- Clear all the the docker container using
docker system prune --all
Note this will remove all containers fron the suytem if you just wan to remove airflow containers, you can usedocker container prune
- Run
docker compose up airflow-init
- Run
docker compose up
- Run
docker ps
to get the container ID of the airflow-worker container and copy it. - Enter the container using the following
docker exec -it -u root <continer_id> bash
- Create a group ID with the same code as the AIRFLOW_UID using
sudo groupadd <AIRFLOW_UID>
- Give default user a password
sudo passwd default <password>
- Add default user to sudoers
sudo usermod -aG sudo default
- Add default user to the group
sudo usermod -aG <AIRFLOW_UID> default
- Add peremission to /opt/airflow directory using
sudo chown -R default:<ARIFLOW_UID> /opt/airflow/
- The default location for the webserver is http://localhost:8080. The default username and password