Skip to content

Commit

Permalink
Merge pull request #132 from digital-land/feature/airflow-data-collec…
Browse files Browse the repository at this point in the history
…tion-pipelines

Changes for migration of data collection pipelines to Airflow
  • Loading branch information
cpcundill authored Oct 30, 2024
2 parents 2e494cf + 4abe302 commit dca0794
Show file tree
Hide file tree
Showing 16 changed files with 112 additions and 58 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Airflow & DAGs

### Airflow

The data collection pipelines are run in [Airflow](https://airflow.apache.org/) on AWS managed service known as [MWAA](https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html).

### Airflow UI

The Airflow UI is a web application with a HTML Graphical User Interface.

### URL for Airflow UI

A member of the Infrastructure team will be able to provide you with the environment-specific URLs for Airflow.

### DAGs

Airflow's jobs are defined using a Directed Acyclic Graph or ([DAG](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html)) for short. The DAGs are generated via the [airflow-dags](https://github.com/digital-land/airflow-dags/) repo.
1 change: 1 addition & 0 deletions docs/data-operations-manual/Explanation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ This section explain our data operations and key concepts to understanding the p
- [Pipeline processes](./Key-Concepts/pipeline-processes)
- [Specification](./Key-Concepts/Specification)
- [Operational Procedures](Operational-Procedures/)
- [Airflow and DAGs](./Key-Concepts/Airflow-and-DAGs)
Original file line number Diff line number Diff line change
@@ -1,28 +1,23 @@
# Add a new dataset and collection

The following instructions are for when adding an entirely new dataset to the platform which does not have a corresponding collection repo.
The following instructions are for when adding an entirely new dataset to the platform which does not have a corresponding collection configuration.

If you’re adding a new dataset to an existing collection you can follow the same process, but ignore steps 2 and 3.
If you’re adding a new dataset to an existing collection you can follow the same process, but ignore step 2.

#### 1. Ensure the collection is noted in the dataset specification

Not doing this won't result in any specific errors but it should be done to ensure that the list of collections in the specification is up to date.
Skipping this step won't result in any specific errors but is strongly recommended to ensure that the list of collections in the specification is up to date.

Go to https://github.com/digital-land/specification/tree/main/content/dataset and check if there is a .md file for your new dataset

E.g., if you plan to add a dataset named article-4-direction you want to check that https://github.com/digital-land/specification/tree/main/content/dataset/article-4-direction.md exists. That’s all!

If it doesn’t exist, raise this issue with the Data Design team and they should be able to set it up.

#### 2. Create the new collection repo
Next, you will need to create a new collection repo named after the dataset you wish to add. This can be easily done with the collection-template repo found here (https://github.com/digital-land/collection-template). Select Use this template at the top right to be redirected to the repo creation page. When creating the repo, make sure to set the owner as digital-land and set the visibility to public.
#### 2. Create Collection and Pipeline config in the [config repo](https://github.com/digital-land/config)

IMPORTANT: The name of the collection repo must end with "-collection". For example, for the "article-4-direction" dataset, it would need to be called article-4-direction-collection. Failing to do this will cause issues in the pipeline.

#### 3. Create Collection and Pipeline folders in the [config repo](https://github.com/digital-land/config)

Next, you’ll want to make sure your local config main is up to date with the remote main. Then create a new branch. The [config repo](https://github.com/digital-land/config) currently contains collection and pipeline folders for each unique collection held within the pipeline. In this case, there won’t be one for the new collection.
You need to create new folders ( `collection` and `pipeline`) and the required files within those folders for the collection that needs to be created. There is a really handy script for this step, simply run create_collection.py like so:
The [config repo](https://github.com/digital-land/config) contains collection and pipeline folders for each unique collection held within the pipeline. In this case, there won’t be one for the new collection.
You'll need to create new folders ( `collection` and `pipeline`) and the required files within those folders for the collection that needs to be created. There is a really handy script for this step, simply run create_collection.py like so:
python create_collection.py [DATASET NAME] E.g., python create_collection.py article-4-direction

Check that this has created the following files and that only the headers are in them:
Expand All @@ -46,7 +41,8 @@ Check that this has created the following files and that only the headers are in



#### 5. Add endpoint(s) for the new dataset (optionally mapping and default values,)
#### 3. Add endpoint(s) for the new dataset (optionally mapping and default values)

In case a column-field mapping was provided, you’ll need to add the mapping to column.csv. Make sure to map to the correct data, we want to map some column to its equivalent `field` e.g the dataset might have a different name for the reference so we need to map this to our `reference` field.

In case there is a defaultValue given for a field in the schema, we want to add this to the default-value.csv.
Expand All @@ -55,42 +51,68 @@ Follow steps 3 and 4 in ‘Adding an Endpoint’ for instructions on adding mapp

Next, you can just proceed to add an endpoint. Validating the data through the endpoint checker will not work at this point but we can manually check whether the endpoint_url and documentation_url work as well as whether the required fields specified in the schema are present. There will be no further need to validate it. Once the PR is merged, you can go to the next step.

### 6. Test locally
### 4. Test locally

You can run a data collection pipeline locally and then verify your new configuration. You can achieve this in two ways.

#### 4.1 Run using config repo

The easiest option might be to just stay within the config repository.

Clone the collection repo you created in step 2. Create a virtual environment with the following command
python3 -m venv --prompt . .venv --clear --upgrade-deps
Create a virtual environment with the following command:
```
python3 -m venv --prompt . .venv --clear --upgrade-deps
```

You then need to update the collection with the dataset with the following commands:
* make makerules
* make init
* make collect
* make
```
make makerules
make init
make collect
make
```
Finally. you should be able to run `make datasette` and then see a local datasette version with the new dataset on it.

#### 4.2 Run using collection-task repo

If everything is done correctly, you should be able to run `make datasette` and then see a local datasette version with the new dataset on it.
Clone the [collection-task](https://github.com/digital-land/collection-task) repo.

> Tip: When running make datasette and it returns that there is no sql file, make sure that the corresponding collection in config has an endpoint and source, else the database for it will not be set up!
Set the environment variables to values relevant to the collection and dataset you have added within the docker-compose.yaml file. Then run with Docker Compose using `make compose-up`

You should be able to run `make datasette` and then see a local datasette version with the new dataset on it.

#### Notes

> Tip: When running make datasette and it returns that there is no SQL file, make sure that the corresponding collection in config has an endpoint and source, else the database for it will not be set up!
> You’ll want to check the amount of records created in the dataset, this really should be equal to the amount of entries in the raw dataset. Also check the issues table for anything that sticks out and is fixable on our end
### 7. Update run.yml and specification

Finally, you need to uncomment the cron in the collection repo’s run.yml file found in .github/workflows. It should look like this:
````
name: Call Collection Run
on:
schedule:
- cron: 0 0 * * *
workflow_dispatch: null
jobs:
call-workflow:
uses: digital-land/collection-template/.github/workflows/callable_run.yml@main
secrets: inherit
````
Then run the workflow. To run the workflow, go to Actions -> Call Collection Run -> Run workflow on main.


If the run was successful , there will be a green tick next to the newly run action. If it was unsuccessful, check the logs to find out what the issue was.
Once it ran successfully, all that is needed is to add the collection name to the specification. The specification currently should be empty. Change it. Once the change has merged, you’re all done!
Once you're happy with the configuration for the new collection and new dataset, have it reviewed by a colleague and then merged.

### 5. Regenerate Airflow DAGs


Finally, you'll need to ensure the [DAGs for Airflow](https://github.com/digital-land/airflow-dags/) are re-published to AWS. To do this, simply follow the instructions below. Publishing the DAGs is necessary since the latest specification needs to be read to have the relevant collection DAGs created.

#### Re-publish DAGs

1. If you have any DAG configuration changes to make within the airflow-dags repo, raise your changes as a pull request. Seek approval and have it merged.

1. When you're ready to publish DAGs to AWS, run the [GitHub Action to publish DAGs](https://github.com/digital-land/airflow-dags/actions/workflows/deploy.yml). After the GitHub Action has run, you'll be able to verify the collection is present via the [Airflow UI](/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs/#airflow-ui).

#### Verify new collection present

If the collection is present, you should be able to execute it and view details of the last execution, e.g.
![Airflow DAG last execution](/images/data-operations-manual/airflow-dag-last-execution.png)

You should be able to verify that the collection is included in the trigger-collection-dags-manual DAG, e.g.

![Airflow Trigger Collection DAGs - Manual](/images/data-operations-manual/airflow-trigger-collection-dags-manual.png)

If the collection is selected for schedule, then it should also be present within the trigger-collection-dags-scheduled DAG, e.g.

![Airflow Trigger Collection DAGs - Scheduled](/images/data-operations-manual/airflow-trigger-collection-dags-scheduled.png)

If the run was successful, there will be a green bpx next to the newly run workflow. If it was unsuccessful, check the logs to find out what the issue was.

> Tip: If, despite all this, the collection does not appear on datasette, try running the workflow again, or wait until the next cycle of digital-land-builder has run.
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

**Prerequisites:**

- Cloned the [config repo](https://github.com/digital-land/config) by running `git clone [gitURL]` and updated it with `make init` in your virtual environment
- Validated the data. If you haven’t done this yet, follow the steps in ‘[Validating an Endpoint](../../Validating/Validate-an-endpoint)’ before continuing.
- Clone the [config repo](https://github.com/digital-land/config) by running `git clone [gitURL]` and update it with `make init` in your virtual environment
- Validate the data. If you haven’t done this yet, follow the steps in ‘[Validating an Endpoint](../../Validating/Validate-an-endpoint)’ before continuing.

> **NOTE!**
> The endpoint*checker will pre-populate some of the commands mentioned in the steps below, check the end of the notebook underneath ‘\_scripting*’.
Expand Down Expand Up @@ -101,7 +101,7 @@
Use git to push changes up to the repository, each night when the collection runs the files are downloaded from here. It is a good idea to name the commit after the organisation you are importing.

1. **Run action workflow (optional)**
Optionally, you can run the overnight workflow yourself if you don’t want to wait until the next day to check if the data is actually on the platform. Navigate to the corresponding collection’s repository actions page e.g. [article-4-direction-collection](https://github.com/digital-land/article-4-direction-collection/actions) and under ‘Call Collection Run’, run the workflow manually. Depending on the collection, this can take a while but after it has finished running you can check on datasette if the data is on the platform.
Optionally, you can manually execute the workflow that usually runs overnight yourself - if you don’t want to wait until the next day - to check if the data is actually on the platform. Simply follow the instructions in the [guide for triggering a collection manually](/data-operations-manual/How-To-Guides/Maintaining/Trigger-collection-manually).

## Endpoint edge-cases

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Trigger collection manually

Data collection pipelines are executed via [Airflow](/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs).

Collections are usually run on a scheduled-basis by having them selected in the [scheduling configuration of airflow-dags](https://github.com/digital-land/airflow-dags/blob/main/bin/generate_dag_config.py#L13) repo. However, if you want to execute a collection run manually, you can do so by
navigating to the [Airflow UI](/data-operations-manual/Explanation/Key-Concepts/Airflow-and-DAGs/#airflow-ui), finding the relevant collection DAG and clicking the corresponding play button (blue triangle) to trigger an execution e.g.

![Airflow DAG play button execution](/images/data-operations-manual/airflow-dag-play-button.png)

Depending on the collection, this can take a while but after it has finished running you can check on datasette if the expected data is on the platform.
2 changes: 1 addition & 1 deletion docs/development/key-principles.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Key Priciples
# Key Principles

This page sets out our key principles for developers working on code throughout the digital land project.

Expand Down
12 changes: 6 additions & 6 deletions docs/development/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
## Monitoring

Across our infrasture we host multiple applications and data pipelines all under constant development. Natural this system requires multiple methods of monitoring to keep track of what's going on.
Across our infrastructure we host multiple applications and data pipelines all under constant development. Natural this system requires multiple methods of monitoring to keep track of what's going on.

We're still developping and improving our approach to monitoring so please reach out with any new ideas or improvements!
We're still developing and improving our approach to monitoring so please reach out with any new ideas or improvements!

### Slack notifications

The most useful tool at our disposable is the delivery of key notifications in our slack notifications channel. If you are not part of this reach out to the tech ead to get access. There are several key types of notifications:
The most useful tool at our disposable is the delivery of key notifications in our Slack notifications channel. If you are not part of this reach out to the tech ead to get access. There are several key types of notifications:

* Sentry Alerts - We have integrated sentry into our running applications. When a new issue is raised in sentry a notification is posted in the channel. The infrasture team will monitor and triage these alerts but they may be passed tot he relevant teamm for resolution.
* Sentry Alerts - We have integrated sentry into our running applications. When a new issue is raised in sentry a notification is posted in the channel. The infrastructure team will monitor these alerts, triage any issues and possibly pass those onto the relevant team for resolution.
* Deployment Notifications - These are posted by AWS when a new image is created and published by one of our applications to one of our Elastic Container Registries (ECR). It shows the progress as a new container is deployed via blue-green deployment. Make sure to review these when you deploy changes to one of our environments.
* Github Action (GHA) Failures - We still run a lot of processing in github actions across multiple repositories. When one of these fails the details are posted in with a link to the action This only covers data processing actions at the momment.
* GitHub Action (GHA) Failures - We still run a lot of processing in GitHub actions across multiple repositories. When one of these fails the details are posted in with a link to the action This only covers data processing actions at the moment.
* Security Scans - We have security scans set up on our main application. These do both static and dynamic audits of code each week and the reports are posted. We're hoping to apply these scans to multiple repos in the future.

### Sentry
Expand All @@ -21,7 +21,7 @@ There may be scope to log performance and metrics via sentry in the future too.

### Cloudwatch Dashboards

We have several dashboards that can give some metrics based on the logs in our infrastructure. We can give permissions to these dashboards for those that need it
We have several dashboards that can give some metrics based on the logs in our infrastructure. We can give permissions to these dashboards for those that need it



Expand Down
4 changes: 2 additions & 2 deletions docs/development/testing-guidance.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Throughout our codebase there are a number of different types of testing that we

### Unit

Unit tests should always test the smallest piece of code that can be logically isolated in a system. This means that we can ensure the smallest piece of code meets it's requirements. A unit test should mock it's dependencies and shouldn't rely on a file system or a database to run. Altering how code is written can help remove these dependencies or make mocking easier. Larger functions/methods which combine a lot of units of code may not be appropriate to test with a unit test. In these cases Integration tests should be used.
Unit tests should always test the smallest piece of code that can be logically isolated in a system. This means that we can ensure the smallest piece of code meets its requirements. A unit test should mock its dependencies and shouldn't rely on a file system or a database to run. Altering how code is written can help remove these dependencies or make mocking easier. Larger functions/methods which combine a lot of units of code may not be appropriate to test with a unit test. In these cases Integration tests should be used.

### Integration

Expand All @@ -20,7 +20,7 @@ Acceptance tests should. reflect acceptance criteria we set before picking up pi

### Performance

performance tests allow us to focus on optimising a particular part of a process. They will not be ran as part of every PR as they are not based on acceptance criteria but they should be. ran semi-regularly to help us ensure that our code isn't becoming bloated and slow over time.
Performance tests allow us to focus on optimising a particular part of a process. They will not be ran as part of every PR as they are not based on acceptance criteria but they should be. ran semi-regularly to help us ensure that our code isn't becoming bloated and slow over time.

## Testing Structure

Expand Down
Loading

0 comments on commit dca0794

Please sign in to comment.