Skip to content

Running a Whole Genome Pedigree Dataset (NIH Biowulf)

Charles Markello edited this page Jul 13, 2020 · 21 revisions

Setup Instructions

Setup the main working directory

cd into a directory with at least 2 TB of allocated Disk space

cd /data/$USER

Launch an interactive session on Biowulf and load requisite Biowulf modules:

sinteractive
module load git python/3.7

Clone the github repo and create a work directory for running the toil-vg pedigree workflow:

TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
mkdir -p ${TOIL_VG_DIR} && cd ${TOIL_VG_DIR}
git clone --single-branch --branch vg_pedigree_workflow https://github.com/vgteam/toil-vg.git  

Download workflow inputs and set up toil-vg virtual environment to run toil-vg workflows:

WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_toil_vg.sh -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
exit

Input Read Setup Instructions

Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME from this template. The COHORT_NAME should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.

COHORT_INPUT_DATA should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND and SIBLING_1 are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz respectively, then the path for COHORT_INPUT_DATA should be /data/Udpdata/Individuals/.

COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS/"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}

Running the Workflow

CD into cohort work directory and setup input variables. The SIBLING_ID_LIST bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND and 2 additional siblings UDP_SIB_1 and UDP_SIB_2: SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").

For the analysis part of the workflow a few additional inputs are required: The SIBLING_GENDERS bash array needs to list the proband and sibling genders (0 for male, 1 for female). Must follow same order as SIBLING_ID_LIST. The SIBLING_AFFECTED bash array needs to list the proband and sibling affected status (0 for unaffected, 1 for affected). Must follow same order as SIBLING_ID_LIST. The CHROM_ANNOT_DIR bash string needs to contain the full path to the 'Chromosome_Files_hg19' directory. The EDIT_ANNOT_DIR bash string needs to contain the full path to the 'MultiEditor' directory. The CADD_DATA_DIR bash string needs to contain the full path to the 'CADD-scripts-master' directory.

For one of the input variables the PED_FILE must point to a valid .ped file in the form of the COHORT_ID.ped or PROBAND_SAMPLE_ID.ped naming scheme and must follow the tab-delimited PED file format. The .ped file needs to only contain the mother-father-proband trio set of samples. For example the HG002 trio .ped file looks like the following where the proband is HG002 the father is HG003 and the mother is HG004:

#Family ID  Father  Mother  Sex[1=M]    Affected[2=A]
HG002   HG002   HG003   HG004   1   2
HG002   HG003   0   0   1   1
HG002   HG004   0   0   2   1

Setup input variables

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
SIBLING_GENDERS=(0 0 0)
SIBLING_AFFECTED=(1 0 0)
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"
CHROM_ANNOT_DIR="/PATH/TO/Chromosome_Files_hg19"
EDIT_ANNOT_DIR="/PATH/TO/MultiEditor"
CADD_DATA_DIR="/PATH/TO/CADD-scripts-master"

Setup workflow bash script

${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -i ${SIBLING_GENDERS} -b ${SIBLING_AFFECTED} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}

Run the cohort mapping and variant calling workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh

The final output files can be found in the following directory:

${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_pedigree_outstore

Troubleshooting

Troubleshooting within Toil can unfortunately be a very tricky task. The log files for this example run would be located in ${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}_pedigree_workflow.log though they will likely not be the most informative to the real issue, they can act as a starting point to figuring out what really went wrong. The general practice I use when looking at toil log files is to first look at the very latest line that contains the python traceback Traceback (most recent call last):. The traceback can tell you which toil job function in which source file the error occurred in.

Also looking for ERROR lines immediately prior to the Traceback lines in the log should give helpful messages that are likely to pertain the software run within a container image that the error occurred in.

You can also get more information by escalating the logger to use Toils debugger mode and rerunning the workflow script. To do this you will need to modify the workflow running script in ${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}_pedigree_workflow.sh to replace the line --logInfo with --logDebug.

Restarting a workflow

To rerun a workflow you can either rerun the helper script to regenerate the workflow script and rerun that script via the following (NOTE THE -r FLAG):

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
SIBLING_GENDERS=(0 0 0)
SIBLING_AFFECTED=(1 0 0)
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"
CHROM_ANNOT_DIR="/PATH/TO/Chromosome_Files_hg19"
EDIT_ANNOT_DIR="/PATH/TO/MultiEditor"
CADD_DATA_DIR="/PATH/TO/CADD-scripts-master"

${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -i ${SIBLING_GENDERS} -b ${SIBLING_AFFECTED} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR} -r

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh

OR you can manually rerun it by simply adding the --restart flag to the toil-vg command within the bash script file ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_pedigree_workflow.sh and rerunning that workflow script:

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh