Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genie DAG import scripts #1201

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
8023bae
Check-in Avery's scripts
callachennault Oct 31, 2024
c781efe
add portal scripts directory to clone_db_wrapper
callachennault Nov 12, 2024
e6b66d9
Add script for clickhouse import step
callachennault Nov 27, 2024
fbf5f4f
Clean up rest of genie dag scripts
callachennault Nov 27, 2024
6f53f88
Clean up scripts
callachennault Dec 3, 2024
97b67f3
change properties filepath, portal arg, comment out clickhouse import…
callachennault Jan 6, 2025
1b5051c
Fix bug in update process script call
callachennault Jan 8, 2025
f3a72f0
Improve comments, update derived tables script call, clean up code
callachennault Jan 9, 2025
f7fed0e
move variable definitions in import_genie after db color has been det…
callachennault Jan 9, 2025
bf0b87a
uncomment genie import steps, fix typo in check for number of studies…
callachennault Jan 9, 2025
25f1ef9
Change properties file and importer jar to prod files
callachennault Jan 15, 2025
46d5af3
clean up comments
callachennault Jan 15, 2025
a0b367e
Script for airflow task of setting import status
callachennault Jan 15, 2025
ff2aae5
Remove old comment and store script path in env var
callachennault Jan 15, 2025
356ee7b
Store automation-environment path in env var
callachennault Jan 15, 2025
6bd07a1
rename properties file
callachennault Jan 15, 2025
15fbdf1
Change portal column to genie-portal
callachennault Jan 15, 2025
b104a17
integrate derived table script updates
callachennault Jan 23, 2025
97d7647
Rename airflow genie scripts
callachennault Jan 23, 2025
ec0e0ea
remove notification file from import call
callachennault Jan 23, 2025
f87befa
Clean up logging
callachennault Jan 27, 2025
f5388e2
Accept properties file as an argument and remove set_update_process w…
callachennault Jan 28, 2025
39bb5b8
Make genie-airflow-import-clickhouse.sh executable
callachennault Jan 28, 2025
fc1f3b6
set tmp directory in clickhouse import script
callachennault Jan 28, 2025
b4243ca
add dag to PR
callachennault Jan 29, 2025
0e45da4
Commit version of refresh-cdd-oncotree-cache on genie import server
callachennault Jan 29, 2025
f567c48
Use updated slack functions in refresh-cdd-oncotree script
callachennault Jan 29, 2025
1c8899f
change back to test properties file to test on prod server
callachennault Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions dags/import_genie_dag.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
"""
import_genie_dag.py
Imports Genie study to MySQL and ClickHouse databases using blue/green deployment strategy.
"""
from datetime import timedelta, datetime
from airflow import DAG
from airflow.decorators import task
from airflow.exceptions import AirflowException
from airflow.models.param import Param
from airflow.providers.ssh.operators.ssh import SSHOperator
from airflow.utils.trigger_rule import TriggerRule


args = {
"owner": "airflow",
"depends_on_past": False,
"email": ["[email protected]"],
"email_on_failure": True,
"email_on_retry": False,
"retries": 0,
"retry_delay": timedelta(minutes=5),
}

"""
If any upstream tasks failed, this task will propagate the "Failed" status to the Dag Run.
"""
@task(trigger_rule=TriggerRule.ONE_FAILED, retries=0)
def watcher():
raise AirflowException("Failing task because one or more upstream tasks failed.")

with DAG(
dag_id="import_genie_dag",
default_args=args,
description="Imports Genie study to MySQL and ClickHouse databases using blue/green deployment strategy",
dagrun_timeout=timedelta(minutes=360),
max_active_runs=1,
start_date=datetime(2024, 12, 3),
schedule_interval=None,
tags=["genie"],
params={
"importer": Param("genie", type="string", title="Import Pipeline", description="Determines which importer to use. Must be one of: ['genie']"),
"data_repos": Param("genie,dmp", type="string", title="Data Repositories", description="Comma-separated list of data repositories to pull updates from/cleanup. Accepted values: ['genie', 'dmp']")
}
) as dag:

conn_id = "genie_importer_ssh"
import_scripts_path = "/data/portal-cron/scripts"
db_properties_filepath="/data/portal-cron/pipelines-credentials/manage_genie_database_update_tools.properties"

"""
Parses and validates DAG arguments
"""
@task
def parse_args(importer: str, data_repos: str):
to_use = []
if importer.strip() not in ['genie']:
raise TypeError('Required argument \'importer\' is incorrect or missing a value.')

ACCEPTED_DATA_REPOS = ["genie", "dmp"]
root_data_directory_path = "/data/portal-cron/cbio-portal-data"
for data_repo in data_repos.split(","):
if data_repo.strip() not in ACCEPTED_DATA_REPOS:
raise TypeError('Required argument \'data_repos\' is incorrect.')
to_use.append(root_data_directory_path + "/" + data_repo.strip())

data_repositories_to_use = ' '.join(to_use)
return data_repositories_to_use

datarepos = "{{ task_instance.xcom_pull(task_ids='parse_args') }}"

"""
Determines which database is "production" vs "not production"
Drops tables in the non-production MySQL database
Clones the production MySQL database into the non-production database
"""
clone_database = SSHOperator(
task_id="clone_database",
ssh_conn_id=conn_id,
command=f"{import_scripts_path}/genie-airflow-clone-db.sh {import_scripts_path} {db_properties_filepath}",
dag=dag,
)

"""
Does a db check for specified importer/pipeline
Fetches latest commit from GENIE repository
Refreshes CDD/Oncotree caches
"""
setup_import = SSHOperator(
task_id="setup_import",
ssh_conn_id=conn_id,
command=f"{import_scripts_path}/genie-airflow-setup.sh {{{{ params.importer }}}} {import_scripts_path} {db_properties_filepath}",
dag=dag,
)

"""
Imports cancer types
Imports genie-portal column in portal-configuration spreadsheet
"""
import_genie = SSHOperator(
task_id="import_genie",
ssh_conn_id=conn_id,
command=f"{import_scripts_path}/genie-airflow-import-sql.sh {{{{ params.importer }}}} {import_scripts_path} {db_properties_filepath}",
dag=dag,
)

"""
Drops ClickHouse tables
Copies MySQL tables to ClickHouse
Creates derived ClickHouse tables
"""
import_clickhouse = SSHOperator(
task_id="import_clickhouse",
ssh_conn_id=conn_id,
command=f"{import_scripts_path}/genie-airflow-import-clickhouse.sh {import_scripts_path} {db_properties_filepath}",
dag=dag,
)

"""
If any upstream tasks failed, mark the import attempt as abandoned.
"""
set_import_status = SSHOperator(
task_id="set_import_status",
ssh_conn_id=conn_id,
trigger_rule=TriggerRule.ONE_FAILED,
command=f"{import_scripts_path}/set_update_process_state.sh {db_properties_filepath} abandoned",
dag=dag,
)

"""
Removes untracked files/LFS objects from Genie repo.
"""
cleanup_genie = SSHOperator(
task_id="cleanup_genie",
ssh_conn_id=conn_id,
trigger_rule=TriggerRule.ALL_DONE,
command=f"{import_scripts_path}/datasource-repo-cleanup.sh {datarepos}",
dag=dag,
)

parsed_args = parse_args("{{ params.importer }}", "{{ params.data_repos }}")
parsed_args >> clone_database >> setup_import >> import_genie >> import_clickhouse >> set_import_status >> cleanup_genie
list(dag.tasks) >> watcher()
78 changes: 78 additions & 0 deletions import-scripts/genie-airflow-clone-db.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/bin/bash

# Script for running pre-import steps
# Consists of the following:
# - Determine which database is "production" vs "not production"
# - Drop tables in the non-production database
# - Clone the production database into the non-production database

PORTAL_SCRIPTS_DIRECTORY=$1
MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH=$2
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

# Create tmp directory for processing
tmp=$PORTAL_HOME/tmp/import-cron-genie
if ! [ -d "$tmp" ] ; then
if ! mkdir -p "$tmp" ; then
echo "Error: could not create tmp directory '$tmp'" >&2
exit 1
fi
fi
if [[ -d "$tmp" && "$tmp" != "/" ]]; then
rm -rf "$tmp"/*
fi

SET_UPDATE_PROCESS_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/set_update_process_state.sh"
GET_DB_IN_PROD_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/get_database_currently_in_production.sh"
DROP_TABLES_FROM_MYSQL_DATABASE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/drop_tables_in_mysql_database.sh"
CLONE_MYSQL_DATABASE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/clone_mysql_database.sh"

# Update the process status database
if ! $SET_UPDATE_PROCESS_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH running ; then
echo "Error during execution of $SET_UPDATE_PROCESS_SCRIPT_FILEPATH : could not set running state" >&2
exit 1
fi

# Get the current production database color
current_production_database_color=$(sh $GET_DB_IN_PROD_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH)
source_database_color="unset"
destination_database_color="unset"
if [ ${current_production_database_color:0:4} == "blue" ] ; then
source_database_color="blue"
destination_database_color="green"
fi
if [ ${current_production_database_color:0:5} == "green" ] ; then
source_database_color="green"
destination_database_color="blue"
fi
if [ "$destination_database_color" == "unset" ] ; then
echo "Error during determination of the destination database color" >&2
exit 1
fi

echo "Source DB color: $source_database_color"
echo "Destination DB color: $destination_database_color"

# Drop tables in the non-production database to make space for cloning
echo "dropping tables from mysql database $destination_database_color"
if ! $DROP_TABLES_FROM_MYSQL_DATABASE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color ; then
message="Error during dropping of tables from mysql database $destination_database_color"
echo $message >&2
exit 1
else
# Clone the content of the production MySQL database into the non-production database
echo "copying tables from mysql database $source_database_color to $destination_database_color"
if ! $CLONE_MYSQL_DATABASE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $source_database_color $destination_database_color ; then
message="Error during cloning the mysql database (from $source_database_color to $destination_database_color)"
echo $message >&2
exit 1
fi
fi
84 changes: 84 additions & 0 deletions import-scripts/genie-airflow-import-clickhouse.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/bin/bash

# Script for updating ClickHouse DB
# Consists of the following:
# - Drop ClickHouse tables
# - Copy MySQL tables to ClickHouse
# - Create derived ClickHouse tables

PORTAL_SCRIPTS_DIRECTORY=$1
MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH=$2
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

tmp=$PORTAL_HOME/tmp/import-cron-genie
GET_DB_IN_PROD_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/get_database_currently_in_production.sh"
DROP_TABLES_FROM_CLICKHOUSE_DATABASE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/drop_tables_in_clickhouse_database.sh"
COPY_TABLES_FROM_MYSQL_TO_CLICKHOUSE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/copy_mysql_database_tables_to_clickhouse.sh"
DOWNLOAD_DERVIED_TABLE_SQL_FILES_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/download_clickhouse_sql_scripts_py3.py"
CREATE_DERIVED_TABLES_IN_CLICKHOUSE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/create_derived_tables_in_clickhouse_database.sh"

# Get the current production database color
current_production_database_color=$(sh $GET_DB_IN_PROD_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH)
destination_database_color="unset"
if [ ${current_production_database_color:0:4} == "blue" ] ; then
destination_database_color="green"
fi
if [ ${current_production_database_color:0:5} == "green" ] ; then
destination_database_color="blue"
fi
if [ "$destination_database_color" == "unset" ] ; then
echo "Error during determination of the destination database color" >&2
exit 1
fi

callachennault marked this conversation as resolved.
Show resolved Hide resolved
echo "Destination DB color: $destination_database_color"

# Drop tables from non-production ClickHouse DB to make room for incoming copy
echo "dropping tables from clickhouse database $destination_database_color to make room for incoming copy"
if ! $DROP_TABLES_FROM_CLICKHOUSE_DATABASE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color ; then
echo "Error during dropping of tables from clickhouse database $destination_database_color" >&2
exit 1
fi

# Use Sling to copy data from non-production MySQL DB to non-production ClickHouse DB
echo "copying tables from mysql database $destination_database_color to clickhouse database $destination_database_color"
if ! $COPY_TABLES_FROM_MYSQL_TO_CLICKHOUSE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color ; then
echo "Error during copying of tables from mysql database $destination_database_color to clickhouse database $destination_database_color" >&2
exit 1
fi

# Check if derived table sql script dirpath exists
# If not, try to create it
derived_table_sql_script_dirpath="$tmp/create_derived_clickhouse_tables"
if ! [ -e "$derived_table_sql_script_dirpath" ] ; then
if ! mkdir -p "$derived_table_sql_script_dirpath" ; then
echo "Error: could not create target directory '$derived_table_sql_script_dirpath'" >&2
exit 1
fi
fi

# Remove any scripts currently in the derived table sql script dirpath
if [[ -d "$derived_table_sql_script_dirpath" && "$derived_table_sql_script_dirpath" != "/" ]]; then
rm -rf "$derived_table_sql_script_dirpath"/*
fi

# Attempt to download the derived table SQL files from github
if ! $DOWNLOAD_DERVIED_TABLE_SQL_FILES_SCRIPT_FILEPATH "$derived_table_sql_script_dirpath" ; then
echo "Error during download of derived table construction .sql files from github" >&2
exit 1
fi

# Create the additional derived tables inside of non-production Clickhouse DB
echo "creating derived tables in clickhouse database $destination_database_color"
if ! $CREATE_DERIVED_TABLES_IN_CLICKHOUSE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color "$derived_table_sql_script_dirpath"/* ; then
echo "Error during derivation of clickhouse tables in clickhouse database $destination_database_color" >&2
exit 1
fi
69 changes: 69 additions & 0 deletions import-scripts/genie-airflow-import-sql.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
#!/bin/bash

# Script for running GENIE import
# Consists of the following:
# - Import of cancer types
# - Import from genie-portal column in spreadsheet

IMPORTER=$1
PORTAL_SCRIPTS_DIRECTORY=$2
MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH=$3
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

# Get the current production database color
GET_DB_IN_PROD_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/get_database_currently_in_production.sh"
current_production_database_color=$(sh $GET_DB_IN_PROD_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH)
destination_database_color="unset"
if [ ${current_production_database_color:0:4} == "blue" ] ; then
destination_database_color="green"
fi
if [ ${current_production_database_color:0:5} == "green" ] ; then
destination_database_color="blue"
fi
if [ "$destination_database_color" == "unset" ] ; then
echo "Error during determination of the destination database color" >&2
exit 1
fi

tmp=$PORTAL_HOME/tmp/import-cron-genie
IMPORTER_JAR_FILENAME="/data/portal-cron/lib/$IMPORTER-aws-importer-$destination_database_color.jar"
JAVA_IMPORTER_ARGS="$JAVA_SSL_ARGS -Dspring.profiles.active=dbcp -Djava.io.tmpdir=$tmp -ea -cp $IMPORTER_JAR_FILENAME org.mskcc.cbio.importer.Admin"
ONCOTREE_VERSION="oncotree_2019_12_01"

# Direct importer logs to stdout
tail -f $PORTAL_HOME/logs/genie-aws-importer.log &

echo "Destination DB color: $destination_database_color"
echo "Importing with $IMPORTER_JAR_FILENAME"
callachennault marked this conversation as resolved.
Show resolved Hide resolved
echo "Importing cancer type updates into mysql database $destination_database_color"
$JAVA_BINARY -Xmx16g $JAVA_IMPORTER_ARGS --import-types-of-cancer --oncotree-version $ONCOTREE_VERSION
if [ $? -gt 0 ]; then
echo "Error: Cancer type import failed!" >&2
exit 1
fi

echo "Importing genie study data into mysql database $destination_database_color"
$JAVA_BINARY -Xmx64g $JAVA_IMPORTER_ARGS --update-study-data --portal genie-portal --update-worksheet --oncotree-version $ONCOTREE_VERSION --transcript-overrides-source mskcc --disable-redcap-export
if [ $? -gt 0 ]; then
echo "Error: Genie import failed!" >&2
exit 1
fi

num_studies_updated=''
num_studies_updated_filename="$tmp/num_studies_updated.txt"
if [ -r "$num_studies_updated_filename" ] ; then
num_studies_updated=$(cat "$num_studies_updated_filename")
fi
if [[ -z $num_studies_updated ]] || [[ $num_studies_updated == "0" ]] ; then
echo "Error: No studies updated, either due to error or failure to mark a study in the spreadsheet" >&2
exit 1
fi
echo "$num_studies_updated number of studies were updated"
Loading