SPIKE: Create harvesting workflow using apache airflow. #4422

rshewitt · 2023-08-14T22:03:06Z

User Story

In order to determine the viability of apache airflow as a data harvesting workflow solution, data.gov will standup an instance of Apache airflow and run a harvest source through the harvest job lifecycle seen in the Harvest Job Lifecycle diagram

Acceptance Criteria

GIVEN data.gov wants to explore existing workflow solutions for a new harvesting pipeline and 4 days time, THEN apache airflow and the harvesting repo will be used to determine solution viability and deployment to cloud.gov along with discussion of findings.

Background

Apache airflow
- Develop, schedule, and monitor data workflows with python.
- Well documented and maintaining open source tool
- Potentially integrate with SSB and/or cloud.gov well.
DAG as a workflow representation

Sketch

Do the following in order, as time allows given this is a spike:

Spin up Apache Airflow locally using docker
Create a DAG that extracts and validates a DCAT-US catalog (use test examples from harvesting-logic repo)
Validate ^^ process works
Validate UI works for exploration of jobs

btylerburton · 2023-08-15T19:16:32Z

Some assumptions to test...

Queue
- Can we spin up multiple workers?
- Parallelize harvest processing
Processing
- One to many for workers
UI
- Managed access

btylerburton · 2023-08-15T21:44:48Z

I'm going to put my name on this as I'll be devoting some time to it, but that shouldn't preclude others from working on it as well, either collaboratively or in private.

rshewitt · 2023-08-18T15:37:37Z

The airflow scheduler ( airflow version 2.3.0 ) won't detect a DAG using the TaskFlow API approach ( e.g. decorating with task instead of using PythonOperator see comment below as an example ) unless you assign the DAG function invocation to something. For example, this works...

_ = dag_function()

This doesn't work...

dag_function()

When using the latter approach the scheduler log indicates ...WARNING - No viable dags retrieved from [location of dag script.py]). This seems to conflict with an example found in the TaskFlow documentation

rshewitt · 2023-08-18T16:01:07Z

using the TaskFlow API approach means less code to write. For example,

#dag creation has already occurred 
@task(task_id="transform")
def transform():
    # do something
transform() #dag workflow

compares to

#dag creation has already occurred
def transform():
  #do something

task1 = PythonOperator(
    task_id='transform',
    python_callable=tranform,
    dag=dag, # attach task to dag
)
task1() #dag workflow

Long story short it removes the explicit need for the PythonOperator ( in this instance ). It may apply to any conventional airflow operator ( e.g. email, python, bash ).

rshewitt · 2023-08-19T00:21:37Z

I published my branch airflow-etl-test-reid. it's intended as a WIP. I paired with @jbrown-xentity earlier and went over my current findings. we discussed the possibility of tasks needing to share data between themselves ( e.g. extract to transform ) and based on what i've read tasks are meant to execute in isolation but airflow offers a cross-task communication mechanism called XComs which allows for this. the amount/size of information passed in order to see a reaction in airflow could be a valuable test.

rshewitt · 2023-08-19T00:24:43Z

if sharing data between tasks proves unacceptable one alternative could be having each task pull data from s3, process it, then load it back to s3. i'm unsure if this is better.

rshewitt · 2023-08-21T16:38:34Z

looks like tasks push to xcom by default if they return a value that is not None

rshewitt · 2023-08-21T18:31:32Z

some run information when extracting and validating a large spatial dataset. Notably the Max Run Duration. This could be a miscalculation based on limited information from the task. it's probably something to do with the first run start and last run start.

jbrown-xentity · 2023-08-21T20:07:08Z

To summarize: we have a working version of a DAG (a harvest) that does the extract of a full DCAT-US catalog and distributes the individual datasets into a new workflow where they get validated independently.
There are 4 things we would like to investigate further:

The ability of airflow to handle one-to-many streams (and then one-to-one after one-to-many)
How can branching be used to implement railway programming and handle failures gracefully? https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#branching
Creating DAG's based on a template and list of configs: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#dynamic-dags (so the UI looks similar to current data.gov harvesting, in each harvest source is represented).
Deploy a working UI to cloud.gov (and integrate any of the current working features)

rshewitt · 2023-08-22T18:45:18Z

Test case where 2 of the dcatus records in the catalog are invalid.

Proceeded by only 1 load ( the valid record ).

rshewitt · 2023-08-22T18:50:16Z

Test case where the extraction failed resulting in all downstream tasks ( validate & load ) to fail.

rshewitt · 2023-08-22T19:00:59Z

The picture above indicates the validate and load were kicked off which we may want to avoid entirely but looking at the task duration below of the validate shows that despite the job kicking off no time was spent on it.

rshewitt · 2023-08-23T18:41:50Z

i'm going to use a subset of the current catalog harvest sources using this query as the input for dynamically generating DAGs.

jbrown-xentity · 2023-08-23T21:08:00Z

For cloud.gov installation, would utilize pip installation on python runner, probably with redis extras: https://airflow.apache.org/docs/apache-airflow/1.10.10/installation.html#extra-packages

robert-bryson · 2023-08-25T16:23:41Z

Breaking off the deploy onto cloudgov work into #4434

rshewitt · 2023-08-26T19:01:03Z

in response to my previous comment i've since found that DAG’s are identified by whether they exist in the global namespace of the file ( i.e. if they are in globals()) when the file is processed. function invocation or class instantiation without storing the value in a variable does not add to the global namespace of the file. since globals() returns a dictionary not using a variable is not assigning a key to the value.

rshewitt · 2023-08-28T16:36:39Z

the size limit for an xcom using a postgres backend looks to be 1 GB. I've seen this value referenced in other articles as well.

rshewitt · 2023-08-28T17:34:36Z

docker crashed after attempting to process 16 instances of a dataset on my local machine. it appeared to be a memory issue.

btylerburton · 2023-08-28T17:36:08Z

let's do some real load testing on cloud.gov after @robert-bryson's work is stood up.

rshewitt · 2023-08-29T15:01:06Z

Something to note on callback execution of tasks. This is in reference to my work on error handling.

rshewitt · 2023-08-29T17:16:29Z

callbacks functions are provided with the associated task context. here's a list of the variables within a context

rshewitt · 2023-08-29T17:20:14Z

What kind of techniques do we have to control the flow of information when something wrong occurs?

rshewitt · 2023-08-30T17:52:59Z

using the TaskFlow API approach requires calling the tasks in the workflow in order for them to work properly.

# TaskFlow API approach
@dag(*args,**kwargs)
def test_pipeline():

  @task(task_id="task1")
  def task1():
    return "example"
  
  @task(task_id="task2")
  def task2():
    return "another example"

  task1() >> task2() #calling the functions

_ = test_pipeline()

#Traditional approach
with DAG(**kwargs) as dag: #using a context manager because decorating would be using TaskFlow. i'm sure you can use a context manager with TaskFlow and it would work fine.

  task1 = EmptyOperator(task_id="task1")
  task2 = EmptyOperator(task_id="task2")

  task1 >> task2 #not calling the operators

_ = dag() #i figure this is still applicable??

issue comment for reference on TaskFlow approach.

Basically, if you're using operators then you don't need to invoke them. if you're using functions decorated with task then you need to invoke them.

rshewitt · 2023-09-01T15:27:11Z

Rule of thumb, anytime branching is implemented ALWAYS consider the trigger_rule of downstream processes. By default tasks have a trigger_rule of "all_success" meaning all immediate parents tasks must succeed. In the event of a branch, the successful branch is followed and the failure branch is not causing all tasks in the failure branch to be skipped. skipped is a state of a task. skipped != success. a common pattern is to have branches join together eventually ( e.g. a bulk load ). this load would be skipped if a branch converging into it was skipped as a result of the branching process.

rshewitt · 2023-09-01T15:39:13Z

We could potentially store our validation schema's as variables in airflow.

rshewitt · 2023-09-01T17:01:26Z

                     -> task C (true)    \
                    /
 task A -> taskBranch                            -> task F
                    \  ->  task E(Dummy)(false) /

^ this works.

apparently branching requires at least 2 tasks. one for true and one for false. it seems like taskBranch can't create a direct dependency to task F if a false task isn't provided so a dummy operator needs to be used.

                     -> task C (true)  \
                   /
 task A -> taskBranch   -----            -> task F

^ this doesn't work.

rshewitt · 2023-09-06T17:18:17Z

Error extracting commerce_non_spatial_data.json_harvest_source_workflow in airflow. The size of the downloaded json is too large for xcom to push (i.e. airflow.exceptions.UnmappableXComLengthPushed: unmappable return value length: 1609 > 1024). source

rshewitt · 2023-09-06T17:20:17Z

Error extracting bls_data_workflow. The airflow log indicates a json decode error. Attempting to download the json file manually indicates a 403 http status code.

rshewitt added this to data.gov team board Aug 14, 2023

rshewitt converted this from a draft issue Aug 14, 2023

rshewitt added the H2.0/Harvest-General label Aug 14, 2023

btylerburton self-assigned this Aug 15, 2023

btylerburton moved this from New Dev to 🏗 In Progress [8] in data.gov team board Aug 15, 2023

rshewitt self-assigned this Aug 18, 2023

btylerburton mentioned this issue Aug 22, 2023

[EPIC] DCAT-US Harvesting Pipeline MVP #4395

Closed

11 tasks

This was referenced Aug 25, 2023

Configure S3 to store Harvester Records #4335

Open

Deploy Airflow on Cloud Foundry using pip #4434

Closed

btylerburton removed their assignment Sep 11, 2023

rshewitt mentioned this issue Sep 13, 2023

Airflow etl test reid GSA/datagov-harvester#18

Merged

4 tasks

rshewitt moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Sep 13, 2023

rshewitt moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Sep 13, 2023

hkdctol closed this as completed Sep 14, 2023

hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Sep 14, 2023

btylerburton added H2.0/orchestrator and removed H2.0/Harvest-General labels Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIKE: Create harvesting workflow using apache airflow. #4422

SPIKE: Create harvesting workflow using apache airflow. #4422

rshewitt commented Aug 14, 2023 •

edited

Loading

btylerburton commented Aug 15, 2023 •

edited

Loading

btylerburton commented Aug 15, 2023

rshewitt commented Aug 18, 2023 •

edited

Loading

rshewitt commented Aug 18, 2023 •

edited

Loading

rshewitt commented Aug 19, 2023

rshewitt commented Aug 19, 2023 •

edited

Loading

rshewitt commented Aug 21, 2023

rshewitt commented Aug 21, 2023 •

edited

Loading

jbrown-xentity commented Aug 21, 2023

rshewitt commented Aug 22, 2023

rshewitt commented Aug 22, 2023

rshewitt commented Aug 22, 2023

rshewitt commented Aug 23, 2023

jbrown-xentity commented Aug 23, 2023

robert-bryson commented Aug 25, 2023

rshewitt commented Aug 26, 2023 •

edited

Loading

rshewitt commented Aug 28, 2023

rshewitt commented Aug 28, 2023

btylerburton commented Aug 28, 2023

rshewitt commented Aug 29, 2023

rshewitt commented Aug 29, 2023 •

edited

Loading

rshewitt commented Aug 29, 2023 •

edited

Loading

rshewitt commented Aug 30, 2023 •

edited

Loading

rshewitt commented Sep 1, 2023 •

edited

Loading

rshewitt commented Sep 1, 2023

rshewitt commented Sep 1, 2023 •

edited

Loading

rshewitt commented Sep 6, 2023

rshewitt commented Sep 6, 2023

SPIKE: Create harvesting workflow using apache airflow. #4422

SPIKE: Create harvesting workflow using apache airflow. #4422

Comments

rshewitt commented Aug 14, 2023 • edited Loading

User Story

Acceptance Criteria

Background

Sketch

btylerburton commented Aug 15, 2023 • edited Loading

btylerburton commented Aug 15, 2023

rshewitt commented Aug 18, 2023 • edited Loading

rshewitt commented Aug 18, 2023 • edited Loading

rshewitt commented Aug 19, 2023

rshewitt commented Aug 19, 2023 • edited Loading

rshewitt commented Aug 21, 2023

rshewitt commented Aug 21, 2023 • edited Loading

jbrown-xentity commented Aug 21, 2023

rshewitt commented Aug 22, 2023

rshewitt commented Aug 22, 2023

rshewitt commented Aug 22, 2023

rshewitt commented Aug 23, 2023

jbrown-xentity commented Aug 23, 2023

robert-bryson commented Aug 25, 2023

rshewitt commented Aug 26, 2023 • edited Loading

rshewitt commented Aug 28, 2023

rshewitt commented Aug 28, 2023

btylerburton commented Aug 28, 2023

rshewitt commented Aug 29, 2023

rshewitt commented Aug 29, 2023 • edited Loading

rshewitt commented Aug 29, 2023 • edited Loading

rshewitt commented Aug 30, 2023 • edited Loading

rshewitt commented Sep 1, 2023 • edited Loading

rshewitt commented Sep 1, 2023

rshewitt commented Sep 1, 2023 • edited Loading

rshewitt commented Sep 6, 2023

rshewitt commented Sep 6, 2023

rshewitt commented Aug 14, 2023 •

edited

Loading

btylerburton commented Aug 15, 2023 •

edited

Loading

rshewitt commented Aug 18, 2023 •

edited

Loading

rshewitt commented Aug 18, 2023 •

edited

Loading

rshewitt commented Aug 19, 2023 •

edited

Loading

rshewitt commented Aug 21, 2023 •

edited

Loading

rshewitt commented Aug 26, 2023 •

edited

Loading

rshewitt commented Aug 29, 2023 •

edited

Loading

rshewitt commented Aug 29, 2023 •

edited

Loading

rshewitt commented Aug 30, 2023 •

edited

Loading

rshewitt commented Sep 1, 2023 •

edited

Loading

rshewitt commented Sep 1, 2023 •

edited

Loading