Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genie DAG import scripts #1201

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

callachennault
Copy link
Contributor

@callachennault callachennault commented Oct 31, 2024

This PR adds scripts for the Genie import DAG:

  • dags/import_genie_dag.py: DAG which handles the genie import

  • genie-airflow-clone-db.sh: Drop tables in non-prod SQL DB, clone prod DB

  • genie-airflow-setup.sh: DB check, repo fetch, CDD/Oncotree cache refresh

  • genie-airflow-import-sql.sh: Import cancer types into SQL, import from genie-portal column into SQL DB

  • genie-airflow-import-clickhouse.sh: Import into Clickhouse DB

  • updates to refresh-cdd-oncotree-cache.sh to use correct URLs for CDD and OncoTree

  • DISCUSS: Should these scripts be moved to the same repository as the genie import DAG? Should the genie import DAG be moved to this repo? - Discussed with Avery 1/29: will store import_genie_dag in cmo-pipelines

@callachennault callachennault force-pushed the genie-dag-import-scripts branch from 50ceb72 to c2f5996 Compare November 27, 2024 21:23
@callachennault callachennault force-pushed the genie-dag-import-scripts branch from c2f5996 to fbf5f4f Compare November 27, 2024 21:24
@callachennault callachennault reopened this Jan 9, 2025
Copy link
Contributor

@sheridancbio sheridancbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review covers first file only, but most comments apply to all scripts
additional scripts will be reviewed in a second review.

import-scripts/clone_db_wrapper.sh Outdated Show resolved Hide resolved
import-scripts/clone_db_wrapper.sh Outdated Show resolved Hide resolved
import-scripts/clone_db_wrapper.sh Outdated Show resolved Hide resolved
import-scripts/clone_db_wrapper.sh Outdated Show resolved Hide resolved
import-scripts/clone_db_wrapper.sh Outdated Show resolved Hide resolved
import-scripts/clone_db_wrapper.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@sheridancbio sheridancbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional issues found in remaining files (part 2 of review)

import-scripts/import_clickhouse.sh Outdated Show resolved Hide resolved
import-scripts/import_clickhouse.sh Outdated Show resolved Hide resolved
import-scripts/import_clickhouse.sh Outdated Show resolved Hide resolved
import-scripts/import_genie.sh Outdated Show resolved Hide resolved
import-scripts/set_import_status.sh Outdated Show resolved Hide resolved
import-scripts/import_genie.sh Outdated Show resolved Hide resolved
import-scripts/import_genie.sh Outdated Show resolved Hide resolved
import-scripts/import_genie.sh Outdated Show resolved Hide resolved
import-scripts/import_genie.sh Outdated Show resolved Hide resolved
import-scripts/import_genie.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@averyniceday averyniceday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good for first release --

couple comments regarding logging messages, mostly just clarifying Source/Dest db.

For future work/improvements, need to make sure server is inline with what the scripts expect (esp. in this SSHOperator mode of running)

Also suggest (maybe this is already in the DAG repo) -- adding a README.md for the DAG workflow and steps. What is "wrapper" logic versus tooling logic and expected frequency of updates/tweaks.

@callachennault
Copy link
Contributor Author

thanks @averyniceday, responding below:

For future work/improvements, need to make sure server is inline with what the scripts expect (esp. in this SSHOperator mode of running)

Can you elaborate on what you mean by this?

Also suggest (maybe this is already in the DAG repo) -- adding a README.md for the DAG workflow and steps. What is "wrapper" logic versus tooling logic and expected frequency of updates/tweaks.

I'd prefer not to document the DAG steps because Airflow already generates a DAG diagram which will automatically reflect any changes to the tasks/task order. The tasks are also documented in the DAG file and in the scripts themselves. What do you mean by tooling logic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants