Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genie DAG import scripts #1201

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
8023bae
Check-in Avery's scripts
callachennault Oct 31, 2024
c781efe
add portal scripts directory to clone_db_wrapper
callachennault Nov 12, 2024
e6b66d9
Add script for clickhouse import step
callachennault Nov 27, 2024
fbf5f4f
Clean up rest of genie dag scripts
callachennault Nov 27, 2024
6f53f88
Clean up scripts
callachennault Dec 3, 2024
97b67f3
change properties filepath, portal arg, comment out clickhouse import…
callachennault Jan 6, 2025
1b5051c
Fix bug in update process script call
callachennault Jan 8, 2025
f3a72f0
Improve comments, update derived tables script call, clean up code
callachennault Jan 9, 2025
f7fed0e
move variable definitions in import_genie after db color has been det…
callachennault Jan 9, 2025
bf0b87a
uncomment genie import steps, fix typo in check for number of studies…
callachennault Jan 9, 2025
25f1ef9
Change properties file and importer jar to prod files
callachennault Jan 15, 2025
46d5af3
clean up comments
callachennault Jan 15, 2025
a0b367e
Script for airflow task of setting import status
callachennault Jan 15, 2025
ff2aae5
Remove old comment and store script path in env var
callachennault Jan 15, 2025
356ee7b
Store automation-environment path in env var
callachennault Jan 15, 2025
6bd07a1
rename properties file
callachennault Jan 15, 2025
15fbdf1
Change portal column to genie-portal
callachennault Jan 15, 2025
b104a17
integrate derived table script updates
callachennault Jan 23, 2025
97d7647
Rename airflow genie scripts
callachennault Jan 23, 2025
ec0e0ea
remove notification file from import call
callachennault Jan 23, 2025
f87befa
Clean up logging
callachennault Jan 27, 2025
f5388e2
Accept properties file as an argument and remove set_update_process w…
callachennault Jan 28, 2025
39bb5b8
Make genie-airflow-import-clickhouse.sh executable
callachennault Jan 28, 2025
fc1f3b6
set tmp directory in clickhouse import script
callachennault Jan 28, 2025
b4243ca
add dag to PR
callachennault Jan 29, 2025
0e45da4
Commit version of refresh-cdd-oncotree-cache on genie import server
callachennault Jan 29, 2025
f567c48
Use updated slack functions in refresh-cdd-oncotree script
callachennault Jan 29, 2025
1c8899f
change back to test properties file to test on prod server
callachennault Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions import-scripts/genie-airflow-clone-db.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/bin/bash

# Script for running pre-import steps
# Consists of the following:
# - Determine which database is "production" vs "not production"
# - Drop tables in the non-production database
# - Clone the production database into the non-production database

PORTAL_SCRIPTS_DIRECTORY=$1
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

# Create tmp directory for processing
tmp=$PORTAL_HOME/tmp/import-cron-genie
if ! [ -d "$tmp" ] ; then
if ! mkdir -p "$tmp" ; then
echo "Error : could not create tmp directory '$tmp'" >&2
exit 1
fi
fi
if [[ -d "$tmp" && "$tmp" != "/" ]]; then
rm -rf "$tmp"/*
fi

MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/pipelines-credentials/manage_genie_database_update_tools.properties"
SET_UPDATE_PROCESS_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/set_update_process_state.sh"
GET_DB_IN_PROD_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/get_database_currently_in_production.sh"
DROP_TABLES_FROM_MYSQL_DATABASE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/drop_tables_in_mysql_database.sh"
CLONE_MYSQL_DATABASE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/clone_mysql_database.sh"

# Update the process status database
if ! $SET_UPDATE_PROCESS_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH running ; then
echo "Error during execution of $SET_UPDATE_PROCESS_SCRIPT_FILEPATH : could not set running state" >&2
exit 1
fi

# Get the current production database color
current_production_database_color=$(sh $GET_DB_IN_PROD_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH)
source_database_color="unset"
destination_database_color="unset"
if [ ${current_production_database_color:0:4} == "blue" ] ; then
source_database_color="blue"
destination_database_color="green"
fi
if [ ${current_production_database_color:0:5} == "green" ] ; then
source_database_color="green"
destination_database_color="blue"
fi
if [ "$destination_database_color" == "unset" ] ; then
echo "Error during determination of the destination database color" >&2
exit 1
fi

echo "we are going to clone $source_database_color into $destination_database_color and then import into $destination_database_color"
callachennault marked this conversation as resolved.
Show resolved Hide resolved
# Drop tables in the non-production database to make space for cloning
if ! $DROP_TABLES_FROM_MYSQL_DATABASE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color ; then
message="Error during dropping of tables from mysql database $destination_database_color"
echo $message >&2
exit 1
else
# Clone the content of the production MySQL database into the non-production database
if ! $CLONE_MYSQL_DATABASE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $source_database_color $destination_database_color ; then
message="Error during cloning the mysql database (from $source_database_color to $destination_database_color)"
echo $message >&2
exit 1
fi
fi
81 changes: 81 additions & 0 deletions import-scripts/genie-airflow-import-clickhouse.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/bin/bash

# Script for updating ClickHouse DB
# Consists of the following:
# - Drop ClickHouse tables
# - Copy MySQL tables to ClickHouse
# - Create derived ClickHouse tables

PORTAL_SCRIPTS_DIRECTORY=$1
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/pipelines-credentials/manage_genie_database_update_tools.properties"
GET_DB_IN_PROD_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/get_database_currently_in_production.sh"
DROP_TABLES_FROM_CLICKHOUSE_DATABASE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/drop_tables_in_clickhouse_database.sh"
COPY_TABLES_FROM_MYSQL_TO_CLICKHOUSE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/copy_mysql_database_tables_to_clickhouse.sh"
DOWNLOAD_DERVIED_TABLE_SQL_FILES_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/download_clickhouse_sql_scripts_py3.py"
CREATE_DERIVED_TABLES_IN_CLICKHOUSE_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/create_derived_tables_in_clickhouse_database.sh"

# Get the current production database color
current_production_database_color=$(sh $GET_DB_IN_PROD_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH)
destination_database_color="unset"
if [ ${current_production_database_color:0:4} == "blue" ] ; then
destination_database_color="green"
fi
if [ ${current_production_database_color:0:5} == "green" ] ; then
destination_database_color="blue"
fi
if [ "$destination_database_color" == "unset" ] ; then
echo "Error during determination of the destination database color" >&2
exit 1
fi

callachennault marked this conversation as resolved.
Show resolved Hide resolved
# Drop tables from non-production ClickHouse DB to make room for incoming copy
echo "dropping tables from clickhouse database $destination_database_color..."
if ! $DROP_TABLES_FROM_CLICKHOUSE_DATABASE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color ; then
echo "Error during dropping of tables from clickhouse database $destination_database_color" >&2
exit 1
fi

# Use Sling to copy data from non-production MySQL DB to non-production ClickHouse DB
echo "copying tables from mysql database $destination_database_color to clickhouse database $destination_database_color..."
if ! $COPY_TABLES_FROM_MYSQL_TO_CLICKHOUSE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color ; then
echo "Error during copying of tables from mysql database $destination_database_color to clickhouse database $destination_database_color" >&2
exit 1
fi

# Check if derived table sql script dirpath exists
# If not, try to create it
derived_table_sql_script_dirpath="$tmp/create_derived_clickhouse_tables"
if ! [ -e "$derived_table_sql_script_dirpath" ] ; then
if ! mkdir -p "$derived_table_sql_script_dirpath" ; then
echo "Error : could not create target directory '$derived_table_sql_script_dirpath'" >&2
exit 1
fi
fi

# Remove any scripts currently in the derived table sql script dirpath
if [[ -d "$derived_table_sql_script_dirpath" && "$derived_table_sql_script_dirpath" != "/" ]]; then
rm -rf "$derived_table_sql_script_dirpath"/*
fi

# Attempt to download the derived table SQL files from github
if ! $DOWNLOAD_DERVIED_TABLE_SQL_FILES_SCRIPT_FILEPATH "$derived_table_sql_script_dirpath" ; then
echo "Error : could not download needed derived table construction .sql files from github" >&2
exit 1
fi

# Create the additional derived tables inside of non-production Clickhouse DB
echo "creating derived tables in clickhouse database $destination_database_color..."
if ! $CREATE_DERIVED_TABLES_IN_CLICKHOUSE_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $destination_database_color "$derived_table_sql_script_dirpath"/* ; then
echo "Error during derivation of clickhouse tables in clickhouse database $destination_database_color" >&2
exit 1
fi
68 changes: 68 additions & 0 deletions import-scripts/genie-airflow-import-sql.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash

# Script for running GENIE import
# Consists of the following:
# - Import of cancer types
# - Import from genie-portal column in spreadsheet

IMPORTER=$1
PORTAL_SCRIPTS_DIRECTORY=$2
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

# Get the current production database color
MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/pipelines-credentials/manage_genie_database_update_tools.properties"
GET_DB_IN_PROD_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/get_database_currently_in_production.sh"
current_production_database_color=$(sh $GET_DB_IN_PROD_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH)
destination_database_color="unset"
if [ ${current_production_database_color:0:4} == "blue" ] ; then
destination_database_color="green"
fi
if [ ${current_production_database_color:0:5} == "green" ] ; then
destination_database_color="blue"
fi
if [ "$destination_database_color" == "unset" ] ; then
echo "Error during determination of the destination database color" >&2
exit 1
fi

tmp=$PORTAL_HOME/tmp/import-cron-genie
IMPORTER_JAR_FILENAME="/data/portal-cron/lib/$IMPORTER-aws-importer-$destination_database_color.jar"
JAVA_IMPORTER_ARGS="$JAVA_SSL_ARGS -Dspring.profiles.active=dbcp -Djava.io.tmpdir=$tmp -ea -cp $IMPORTER_JAR_FILENAME org.mskcc.cbio.importer.Admin"
ONCOTREE_VERSION="oncotree_2019_12_01"

# Direct importer logs to stdout
tail -f $PORTAL_HOME/logs/genie-aws-importer.log &

echo "Importing with $IMPORTER_JAR_FILENAME"
callachennault marked this conversation as resolved.
Show resolved Hide resolved
echo "Importing cancer type updates into genie portal database..."
$JAVA_BINARY -Xmx16g $JAVA_IMPORTER_ARGS --import-types-of-cancer --oncotree-version $ONCOTREE_VERSION
if [ $? -gt 0 ]; then
echo "Cancer type import failed!" >&2
exit 1
fi

echo "Importing study data into genie portal database..."
$JAVA_BINARY -Xmx64g $JAVA_IMPORTER_ARGS --update-study-data --portal genie-portal --update-worksheet --oncotree-version $ONCOTREE_VERSION --transcript-overrides-source mskcc --disable-redcap-export
if [ $? -gt 0 ]; then
echo "Genie import failed!" >&2
exit 1
fi

num_studies_updated=''
num_studies_updated_filename="$tmp/num_studies_updated.txt"
if [ -r "$num_studies_updated_filename" ] ; then
num_studies_updated=$(cat "$num_studies_updated_filename")
fi
if [[ -z $num_studies_updated ]] || [[ $num_studies_updated == "0" ]] ; then
echo "No studies updated, either due to error or failure to mark a study in the spreadsheet" >&2
exit 1
fi
echo "$num_studies_updated number of studies were updated"
20 changes: 20 additions & 0 deletions import-scripts/genie-airflow-set-import-status.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
IMPORT_STATUS=$1
callachennault marked this conversation as resolved.
Show resolved Hide resolved
PORTAL_SCRIPTS_DIRECTORY=$2
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

SET_UPDATE_PROCESS_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/set_update_process_state.sh"
MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/pipelines-credentials/manage_genie_database_update_tools.properties"

# Update import status
if ! $SET_UPDATE_PROCESS_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH $IMPORT_STATUS ; then
echo "Error during execution of $SET_UPDATE_PROCESS_SCRIPT_FILEPATH : could not set $IMPORT_STATUS state" >&2
exit 1
fi
64 changes: 64 additions & 0 deletions import-scripts/genie-airflow-setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash

# Script for running pre-import steps
# Consists of the following:
# - Database check (given a specific importer)
# - Data fetch from provided sources
# - Refreshing CDD/Oncotree caches

IMPORTER=$1
PORTAL_SCRIPTS_DIRECTORY=$2
if [ -z $PORTAL_SCRIPTS_DIRECTORY ]; then
PORTAL_SCRIPTS_DIRECTORY="/data/portal-cron/scripts"
fi
AUTOMATION_ENV_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/automation-environment.sh"
if [ ! -f $AUTOMATION_ENV_SCRIPT_FILEPATH ] ; then
echo "`date`: Unable to locate $AUTOMATION_ENV_SCRIPT_FILEPATH, exiting..."
exit 1
fi
source $AUTOMATION_ENV_SCRIPT_FILEPATH

# Get the current production database color
MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/pipelines-credentials/manage_genie_database_update_tools.properties"
GET_DB_IN_PROD_SCRIPT_FILEPATH="$PORTAL_SCRIPTS_DIRECTORY/get_database_currently_in_production.sh"
current_production_database_color=$(sh $GET_DB_IN_PROD_SCRIPT_FILEPATH $MANAGE_DATABASE_TOOL_PROPERTIES_FILEPATH)
destination_database_color="unset"
if [ ${current_production_database_color:0:4} == "blue" ] ; then
destination_database_color="green"
fi
if [ ${current_production_database_color:0:5} == "green" ] ; then
destination_database_color="blue"
fi
if [ "$destination_database_color" == "unset" ] ; then
echo "Error during determination of the destination database color" >&2
exit 1
fi

tmp=$PORTAL_HOME/tmp/import-cron-genie
IMPORTER_JAR_FILENAME="/data/portal-cron/lib/$IMPORTER-aws-importer-$destination_database_color.jar"
echo "Checking using $IMPORTER_JAR_FILENAME"
JAVA_IMPORTER_ARGS="$JAVA_SSL_ARGS -Dspring.profiles.active=dbcp -Djava.io.tmpdir=$tmp -ea -cp $IMPORTER_JAR_FILENAME org.mskcc.cbio.importer.Admin"

# Database check
echo "Checking if database version is compatible"
$JAVA_BINARY $JAVA_IMPORTER_ARGS --check-db-version
if [ $? -gt 0 ]; then
echo "Database version expected by portal does not match version in database!" >&2
exit 1
fi

# Fetch updates in genie repository
echo "Fetching updates from genie..."
$JAVA_BINARY $JAVA_IMPORTER_ARGS --fetch-data --data-source genie --run-date latest
if [ $? -gt 0 ]; then
echo "GENIE fetch failed!" >&2
exit 1
fi

# Refresh CDD/Oncotree cache to pull latest metadata
echo "Refreshing CDD/ONCOTREE caches..."
bash $PORTAL_SCRIPTS_DIRECTORY/refresh-cdd-oncotree-cache.sh
if [ $? -gt 0 ]; then
echo "Failed to refresh CDD and/or ONCOTREE cache during GENIE import!" >&2
exit 1
fi