Merge pull request #132 from digital-land/feature/airflow-data-collec…

…tion-pipelines Changes for migration of data collection pipelines to Airflow
digital-land · Oct 30, 2024 · dca0794 · dca0794
2 parents 2e494cf + 4abe302
commit dca0794
Show file tree

Hide file tree

Showing 16 changed files with 112 additions and 58 deletions.
diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs.md b/docs/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs.md
@@ -0,0 +1,17 @@
+## Airflow & DAGs
+
+### Airflow
+
+The data collection pipelines are run in [Airflow](https://airflow.apache.org/) on AWS managed service known as [MWAA](https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html).
+
+### Airflow UI
+
+The Airflow UI is a web application with a HTML Graphical User Interface.
+
+### URL for Airflow UI
+
+A member of the Infrastructure team will be able to provide you with the environment-specific URLs for Airflow.
+
+### DAGs
+
+Airflow's jobs are defined using a Directed Acyclic Graph or ([DAG](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html)) for short.  The DAGs are generated via the [airflow-dags](https://github.com/digital-land/airflow-dags/) repo. 
diff --git a/docs/data-operations-manual/Explanation/index.md b/docs/data-operations-manual/Explanation/index.md
@@ -8,3 +8,4 @@ This section explain our data operations and key concepts to understanding the p
 - [Pipeline processes](./Key-Concepts/pipeline-processes)
 - [Specification](./Key-Concepts/Specification)
 - [Operational Procedures](Operational-Procedures/)
+- [Airflow and DAGs](./Key-Concepts/Airflow-and-DAGs)
diff --git a/...data-operations-manual/How-To-Guides/Adding/Add-a-new-dataset-and-collection.md b/...data-operations-manual/How-To-Guides/Adding/Add-a-new-dataset-and-collection.md
@@ -1,28 +1,23 @@
 # Add a new dataset and collection
 
-The following instructions are for when adding an entirely new dataset to the platform which does not have a corresponding collection repo.
+The following instructions are for when adding an entirely new dataset to the platform which does not have a corresponding collection configuration.
 
-If you’re adding a new dataset to an existing collection you can follow the same process, but ignore steps 2 and 3.
+If you’re adding a new dataset to an existing collection you can follow the same process, but ignore step 2.
 
 #### 1. Ensure the collection is noted in the dataset specification
 
-Not doing this won't result in any specific errors but it should be done to ensure that the list of collections in the specification is up to date. 
+Skipping this step won't result in any specific errors but is strongly recommended to ensure that the list of collections in the specification is up to date. 
 
 Go to https://github.com/digital-land/specification/tree/main/content/dataset and check if there is a .md file for your new dataset 
 
 E.g., if you plan to add a dataset named article-4-direction you want to check that https://github.com/digital-land/specification/tree/main/content/dataset/article-4-direction.md exists. That’s all!
 
 If it doesn’t exist, raise this issue with the Data Design team and they should be able to set it up.
 
-#### 2. Create the new collection repo 
-Next, you will need to create a new collection repo named after the dataset you wish to add. This can be easily done with the collection-template repo found here (https://github.com/digital-land/collection-template).	Select Use this template at the top right to be redirected to the repo creation page. When creating the repo, make sure to set the owner as digital-land and set the visibility to public. 
+#### 2. Create Collection and Pipeline config in the [config repo](https://github.com/digital-land/config)
 
- IMPORTANT: The name of the collection repo must end with "-collection". For example, for  the "article-4-direction" dataset, it would need to be called article-4-direction-collection. Failing to do this will cause issues in the pipeline.
-
-#### 3. Create Collection and Pipeline folders in the [config repo](https://github.com/digital-land/config)
-
-Next, you’ll want to make sure your local config main is up to date with the remote main. Then create a new branch. The [config repo](https://github.com/digital-land/config) currently contains collection and pipeline folders for each unique collection held within the pipeline. In this case, there won’t be one for the new collection.
-You need to create new folders ( `collection` and `pipeline`) and the required files within those folders for the collection that needs to be created. There is a really handy script for this step, simply run create_collection.py like so: 
+The [config repo](https://github.com/digital-land/config)  contains collection and pipeline folders for each unique collection held within the pipeline. In this case, there won’t be one for the new collection.
+You'll need to create new folders ( `collection` and `pipeline`) and the required files within those folders for the collection that needs to be created. There is a really handy script for this step, simply run create_collection.py like so: 
 python create_collection.py [DATASET NAME] E.g., python create_collection.py article-4-direction
 
 Check that this has created the following files and that only the headers are in them:
@@ -46,7 +41,8 @@ Check that this has created the following files and that only the headers are in
 
 
 
-#### 5. Add endpoint(s) for the new dataset (optionally mapping and default values,)
+#### 3. Add endpoint(s) for the new dataset (optionally mapping and default values)
+
 In case a column-field mapping was provided, you’ll need to add the mapping to column.csv. Make sure to map to the correct data, we want to map some column to its equivalent `field` e.g the dataset might have a different name for the reference so we need to map this to our `reference` field.
 
 In case there is a defaultValue given for a field in the schema, we want to add this to the default-value.csv.
@@ -55,42 +51,68 @@ Follow steps 3 and 4 in ‘Adding an Endpoint’ for instructions on adding mapp
 
 Next, you can just proceed to add an endpoint. Validating the data through the endpoint checker will not work at this point but we can manually check whether the endpoint_url and documentation_url work as well as whether the required fields specified in the schema are present.   There will be no further need to validate it. Once the PR is merged, you can go to the next step.
 
-### 6. Test locally
+### 4. Test locally
+
+You can run a data collection pipeline locally and then verify your new configuration.  You can achieve this in two ways.
+
+#### 4.1 Run using config repo
+
+The easiest option might be to just stay within the config repository.
 
-Clone the collection repo you created in step 2. Create a virtual environment with the following command
- python3 -m venv --prompt . .venv --clear --upgrade-deps
+Create a virtual environment with the following command:
+```
+python3 -m venv --prompt . .venv --clear --upgrade-deps
+```
 
 You then need to update the collection with the dataset with the following commands:
-* make makerules
-* make init
-* make collect
-* make
+```
+make makerules
+make init
+make collect
+make
+```
+Finally. you should be able to run `make datasette` and then see a local datasette version with the new dataset on it.
 
+#### 4.2 Run using collection-task repo
 
-If everything is done correctly, you should be able to run `make datasette` and then see a local datasette version with the new dataset on it.
+Clone the [collection-task](https://github.com/digital-land/collection-task) repo. 
 
-> Tip: When running make datasette and it returns that there is no sql file, make sure that the corresponding collection in config has an endpoint and source, else the database for it will not be set up!
+Set the environment variables to values relevant to the collection and dataset you have added within the docker-compose.yaml file. Then run with Docker Compose using `make compose-up`
+
+You should be able to run `make datasette` and then see a local datasette version with the new dataset on it.
+
+#### Notes
+
+> Tip: When running make datasette and it returns that there is no SQL file, make sure that the corresponding collection in config has an endpoint and source, else the database for it will not be set up!
 
  > You’ll want to check the amount of records created in the dataset, this really should be equal to the amount of entries in the raw dataset. Also check the issues table for anything that sticks out and is fixable on our end
 
-### 7. Update run.yml and specification
-
-Finally, you need to uncomment the cron in the collection repo’s run.yml file found in .github/workflows. It should look like this:
-````
-name: Call Collection Run
-on:
-  schedule:
-    - cron: 0 0 * * *
-  workflow_dispatch: null
-jobs:
-  call-workflow:
-    uses: digital-land/collection-template/.github/workflows/callable_run.yml@main
-    secrets: inherit
-````
-Then run the workflow. To run the workflow, go to Actions -> Call Collection Run -> Run workflow on main.
-
-
-If the run was successful , there will be a green tick next to the newly run action. If it was unsuccessful, check the logs to find out what the issue was. 
-Once it ran successfully, all that is needed is to add the collection name to the specification. The specification currently should be empty. Change it. Once the change has merged, you’re all done!
+Once you're happy with the configuration for the new collection and new dataset, have it reviewed by a colleague and then merged.
+
+### 5. Regenerate Airflow DAGs
+
+
+Finally, you'll need to ensure the [DAGs for Airflow](https://github.com/digital-land/airflow-dags/) are re-published to AWS.  To do this, simply follow the instructions below.   Publishing the DAGs is necessary since the latest specification needs to be read to have the relevant collection DAGs created.
+
+####  Re-publish DAGs
+
+ 1. If you have any DAG configuration changes to make within the airflow-dags repo, raise your changes as a pull request.  Seek approval and have it merged.
+
+ 1. When you're ready to publish DAGs to AWS, run the [GitHub Action to publish DAGs](https://github.com/digital-land/airflow-dags/actions/workflows/deploy.yml). After the GitHub Action has run, you'll be able to verify the collection is present via the [Airflow UI](/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs/#airflow-ui).
+
+#### Verify new collection present
+
+If the collection is present, you should be able to execute it and view details of the last execution, e.g.
+   ![Airflow DAG last execution](/images/data-operations-manual/airflow-dag-last-execution.png)
+
+You should be able to verify that the collection is included in the trigger-collection-dags-manual DAG, e.g.
+
+![Airflow Trigger Collection DAGs - Manual](/images/data-operations-manual/airflow-trigger-collection-dags-manual.png)
+
+If the collection is selected for schedule, then it should also be present within the trigger-collection-dags-scheduled DAG, e.g.
+
+![Airflow Trigger Collection DAGs - Scheduled](/images/data-operations-manual/airflow-trigger-collection-dags-scheduled.png)
+
+If the run was successful, there will be a green bpx next to the newly run workflow. If it was unsuccessful, check the logs to find out what the issue was.
 
 > Tip: If, despite all this, the collection does not appear on datasette, try running the workflow again, or wait until the next cycle of digital-land-builder has run.
diff --git a/docs/data-operations-manual/How-To-Guides/Adding/Add-an-endpoint.md b/docs/data-operations-manual/How-To-Guides/Adding/Add-an-endpoint.md
@@ -2,8 +2,8 @@
 
 **Prerequisites:**
 
-- Cloned the [config repo](https://github.com/digital-land/config) by running `git clone [gitURL]` and updated it with `make init` in your virtual environment
-- Validated the data. If you haven’t done this yet, follow the steps in ‘[Validating an Endpoint](../../Validating/Validate-an-endpoint)’ before continuing.
+- Clone the [config repo](https://github.com/digital-land/config) by running `git clone [gitURL]` and update it with `make init` in your virtual environment
+- Validate the data. If you haven’t done this yet, follow the steps in ‘[Validating an Endpoint](../../Validating/Validate-an-endpoint)’ before continuing.
 
 > **NOTE!**  
 > The endpoint*checker will pre-populate some of the commands mentioned in the steps below, check the end of the notebook underneath ‘\_scripting*’.
@@ -101,7 +101,7 @@
    Use git to push changes up to the repository, each night when the collection runs the files are downloaded from here. It is a good idea to name the commit after the organisation you are importing.
 
 1. **Run action workflow (optional)**  
-   Optionally, you can run the overnight workflow yourself if you don’t want to wait until the next day to check if the data is actually on the platform. Navigate to the corresponding collection’s repository actions page e.g. [article-4-direction-collection](https://github.com/digital-land/article-4-direction-collection/actions) and under ‘Call Collection Run’, run the workflow manually. Depending on the collection, this can take a while but after it has finished running you can check on datasette if the data is on the platform.
+   Optionally, you can manually execute the workflow that usually runs overnight yourself - if you don’t want to wait until the next day - to check if the data is actually on the platform. Simply follow the instructions in the [guide for triggering a collection manually](/data-operations-manual/How-To-Guides/Maintaining/Trigger-collection-manually).
 
 ## Endpoint edge-cases
 

diff --git a/...data-operations-manual/How-To-Guides/Maintaining/Trigger-collection-manually.md b/...data-operations-manual/How-To-Guides/Maintaining/Trigger-collection-manually.md
@@ -0,0 +1,10 @@
+# Trigger collection manually
+
+Data collection pipelines are executed via [Airflow](/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs).
+
+Collections are usually run on a scheduled-basis by having them selected in the [scheduling configuration of airflow-dags](https://github.com/digital-land/airflow-dags/blob/main/bin/generate_dag_config.py#L13) repo.  However, if you want to execute a collection run manually, you can do so by
+ navigating to the [Airflow UI](/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs/#airflow-ui), finding the relevant collection DAG and clicking the corresponding play button (blue triangle) to trigger an execution e.g. 
+
+![Airflow DAG play button execution](/images/data-operations-manual/airflow-dag-play-button.png)
+
+Depending on the collection, this can take a while but after it has finished running you can check on datasette if the expected data is on the platform.
diff --git a/docs/development/key-principles.md b/docs/development/key-principles.md
@@ -1,4 +1,4 @@
-# Key Priciples
+# Key Principles
 
 This page sets out our key principles for developers working on code throughout the digital land project.
 

diff --git a/docs/development/monitoring.md b/docs/development/monitoring.md
@@ -1,16 +1,16 @@
 ## Monitoring
 
-Across our infrasture we host multiple applications and data pipelines all under  constant development. Natural this system requires multiple methods of monitoring to keep track of what's going on.
+Across our infrastructure we host multiple applications and data pipelines all under  constant development. Natural this system requires multiple methods of monitoring to keep track of what's going on.
 
-We're still developping and improving our approach to monitoring so please reach out with any new ideas or improvements!
+We're still developing and improving our approach to monitoring so please reach out with any new ideas or improvements!
 
 ### Slack notifications
 
-The most useful tool at our disposable is the delivery of key notifications in our slack notifications channel. If you are not part of this reach out to the tech ead to get access. There are several key types of notifications:
+The most useful tool at our disposable is the delivery of key notifications in our Slack notifications channel. If you are not part of this reach out to the tech ead to get access. There are several key types of notifications:
 
-* Sentry Alerts - We have integrated sentry into our running applications. When a new issue is raised in sentry a notification is posted in the channel. The infrasture team will monitor and triage these alerts but they may be passed tot he relevant teamm for resolution.
+* Sentry Alerts - We have integrated sentry into our running applications. When a new issue is raised in sentry a notification is posted in the channel. The infrastructure team will monitor these alerts, triage any issues and possibly pass those onto the relevant team for resolution.
 * Deployment Notifications - These are posted by AWS when a new image is created and published by one of our applications to one of our Elastic Container Registries (ECR). It shows the progress as a new container is deployed via blue-green deployment. Make sure to review these when you deploy changes to one of our environments.
-* Github Action (GHA) Failures - We still run a lot of processing in github actions across multiple repositories. When one of these fails the details are posted in with a link to the action This only covers data processing actions at the momment. 
+* GitHub Action (GHA) Failures - We still run a lot of processing in GitHub actions across multiple repositories. When one of these fails the details are posted in with a link to the action This only covers data processing actions at the moment. 
 * Security Scans - We have security scans set up on our main application. These do both static and dynamic audits of code each  week and the reports are posted. We're hoping to apply these scans to multiple repos in the future.
 
 ### Sentry
@@ -21,7 +21,7 @@ There may be scope to log performance and metrics via sentry in the  future too.
 
 ###  Cloudwatch Dashboards
 
-We have several dashboards that can give some metrics based on the  logs in our infrastructure. We can give permissions to these dashboards for those that need it
+We have several dashboards that can give some metrics based on the logs in our infrastructure. We can give permissions to these dashboards for those that need it
 
 
 

diff --git a/docs/development/testing-guidance.md b/docs/development/testing-guidance.md
@@ -8,7 +8,7 @@ Throughout our codebase there are a number of different types of testing that we
 
 ### Unit
 
-Unit tests should always test the smallest piece of code that can be logically isolated in a system. This means that we can ensure the smallest piece of code meets it's requirements. A unit test should mock it's dependencies and shouldn't rely on a file system or a database to run. Altering how code is written can help remove these dependencies or make mocking easier. Larger functions/methods which combine a lot of units of code may not be appropriate to test with a unit test. In these cases Integration tests should be used.
+Unit tests should always test the smallest piece of code that can be logically isolated in a system. This means that we can ensure the smallest piece of code meets its requirements. A unit test should mock its dependencies and shouldn't rely on a file system or a database to run. Altering how code is written can help remove these dependencies or make mocking easier. Larger functions/methods which combine a lot of units of code may not be appropriate to test with a unit test. In these cases Integration tests should be used.
 
 ### Integration
 
@@ -20,7 +20,7 @@ Acceptance tests should. reflect acceptance criteria we set before picking up pi
 
 ### Performance
 
-performance tests allow us to focus on optimising a particular part of a process. They will not be ran as part of every PR as they are not based on acceptance criteria but they should be. ran semi-regularly to help us ensure that our code isn't becoming bloated and slow over time.
+Performance tests allow us to focus on optimising a particular part of a process. They will not be ran as part of every PR as they are not based on acceptance criteria but they should be. ran semi-regularly to help us ensure that our code isn't becoming bloated and slow over time.
 
 ## Testing Structure