Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: [2d] Explore celery as POC airflow executor #4461

Closed
rshewitt opened this issue Sep 12, 2023 · 6 comments
Closed

SPIKE: [2d] Explore celery as POC airflow executor #4461

rshewitt opened this issue Sep 12, 2023 · 6 comments
Assignees

Comments

@rshewitt
Copy link
Contributor

rshewitt commented Sep 12, 2023

Purpose

We want to understand celery as an airflow executor, but we're not sure how to do that.

Given above uncertainty, conducting research is needed to provide factual knowledge on future steps.

2 days of effort has been allocated and once complete, findings will be demonstrated and specific future actions will be decided.

Acceptance Criteria

[ACs should be clearly demo-able/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a local instance of airflow
    AND a collection of datasets to test on
    AND celery configured as the airflow executor
    WHEN 2 days expires
    THEN demonstrate findings of using celery as an airflow executor
    AND provide potential questions for team members to ask at airflow summit.

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Sketch

  • launch local airflow instance with celery configured
  • run datasets through example ETL dag
  • identify additional datasets to expose strong suits/shortcomings of celery as executor (ie large DCAT-US source)
  • record findings ( e.g. odd behaviors, failures, inconsistencies, successes )
  • compile list of questions for team members.
@rshewitt rshewitt added the H2.0/Harvest-General General Harvesting 2.0 Issues label Sep 12, 2023
@rshewitt rshewitt self-assigned this Sep 12, 2023
@rshewitt rshewitt moved this to 🏗 In Progress [8] in data.gov team board Sep 13, 2023
@rshewitt
Copy link
Contributor Author

Not related to celery but the default maximum list/dict length xcom can push to trigger task mapping is 1024 via AIRFLOW__CORE__MAX_MAP_LENGTH ( source )

@rshewitt
Copy link
Contributor Author

are we going to find ourselves in a position where state changes from when a task is requested to when it's executed?

@rshewitt
Copy link
Contributor Author

celery workers prefetch tasks by default ( see worker_prefetch_multiplier ). productivity could be impacted if a task has been prefetched by a worker and not processed. the default visibility_timeout is an hour. decreasing that could pop the task from the busy worker and assign it to an available one?

@rshewitt
Copy link
Contributor Author

rshewitt commented Sep 14, 2023

if the time to execute a task > visibility timeout the task could be perpetually re-assigned and never completed.

@rshewitt
Copy link
Contributor Author

celery tasks don't have time limits by default. this can be configured using task_soft_time_limit ( throws exception ) and/or task_time_limit ( kill then replace worker ).

@rshewitt
Copy link
Contributor Author

supported execution pools:

  • prefork ( default ) (CPU)
    • children are process-based.
  • solo (CPU)
    • the worker does the work instead of a pool of children. less oversight = faster processing?
  • eventlet (I/O)
    • children are greenlet-based "cooperatively scheduled threads".
  • gevent (I/O)
    • children are greenlet-based.

I/O tasks:

  • extract requires downloading something.
  • transform could require using an mdtranslator service.
  • load requires writing to an s3 bucket.

CPU tasks:

  • validation?

@rshewitt rshewitt moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Sep 15, 2023
@hkdctol hkdctol closed this as completed Sep 15, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Sep 15, 2023
@btylerburton btylerburton added H2.0/orchestrator and removed H2.0/Harvest-General General Harvesting 2.0 Issues labels Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants