Skip to content

Running a Whole Genome Pedigree Dataset (NIH Biowulf & DeepTrio, Giraffe support)

Charles Markello edited this page Jul 20, 2021 · 9 revisions

Requirements

  • Approximately 6 TB of Disk space is needed to temporarily compute a whole genome quartet family. Improvements to these requirements are in development.
  • HPC scheduler batch system needs to be slurm-based in order for the configuration of these helper scripts to work. Otherwise you will need to modify the --batchSystem argument in the bash scripts created by these helper scripts.
  • Read-access to the /data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs directory where the candidate analysis workflow inputs are located.
  • Read-access to the /data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs directory where the candidate analysis workflow inputs are located.

Setup Instructions

To start, launch an interactive session on Biowulf and load requisite Biowulf modules for environment and input data setup:

sinteractive
module load git python/3.7

Setup the main working directory and install toil-vg

cd into a directory with at least 6 TB of allocated Disk space

TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
if [ ! -d "${TOIL_VG_DIR}" ]; then
    mkdir -p ${TOIL_VG_DIR}
    chmod 2770 ${TOIL_VG_DIR}
fi
cd ${TOIL_VG_DIR}

Download and install toil-vg software

rm -r toil-vg
GRCH37 version git clone --single-branch --branch vg_pedigree_workflow_deepvariant https://github.com/vgteam/toil-vg.git
GRCH38 version git clone --single-branch --branch vg_pedigree_workflow_deepvariant_dev https://github.com/vgteam/toil-vg.git
git clone https://github.com/cmarkello/toil.git
python3 -m venv toilvg_venv
source toilvg_venv/bin/activate
pip install ./toil
pip install ./toil-vg
deactivate

Input Read Setup Instructions

Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME from this template. The COHORT_NAME should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.

COHORT_INPUT_DATA should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND and SIBLING_1 are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz respectively, then the path for COHORT_INPUT_DATA should be /data/Udpdata/Individuals/.

COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS/"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}

If the read setup script doesn't work for your given data, you can manually add them or soft-link them to the following file structure. The read data should be organized as the following for the previously-given example cohort names:

${COHORT_WORKFLOW_DIR}/input_reads/UDP_MATERNAL_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_MATERNAL_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PATERNAL_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PATERNAL_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PROBAND_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PROBAND_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_1_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_1_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_2_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_2_read_pair_2.fq.gz

Running the Workflow

CD into cohort work directory and setup input variables. The COHORT_NAME is the name of the cohort and the PROBAND sample.

For the analysis part of the workflow a few additional inputs are required: The CHROM_ANNOT_DIR bash string needs to contain the full path to the 'Chromosome_Files_hg19' directory. The EDIT_ANNOT_DIR bash string needs to contain the full path to the 'MultiEditor' directory. The CADD_DATA_DIR bash string needs to contain the full path to the 'CADD-scripts-master' directory.

For the input variables the COHORT_PED_FILE must point to a valid .ped file in the form of the COHORT_ID.ped or PROBAND_SAMPLE_ID.ped naming scheme and must follow the tab-delimited PED file format. The .ped file needs to contain all members of the family set of samples, order doesn't matter. For example a HG002 cohort .ped file looks like the following where the proband is HG002, the father is HG003, the mother is HG004, SIB_1 is an unaffected sibling, and SIB_2 is an affected sibling:

#Family ID  Father  Mother  Sex[1=M]    Affected[2=A]
HG002   HG002   HG003   HG004   1   2
HG002   HG003   0   0   1   1
HG002   HG004   0   0   2   1
HG002   SIB_1    HG003   HG004   2   1
HG002   SIB_2    HG003   HG004   1   2

Setup input variables

COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
GRCH37 version `WORKFLOW_INPUT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs"```
GRCH38 version WORKFLOW_INPUT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs/grch38_inputs"
GRCH37 version ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs"
GRCH38 version ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38"
GRCH37 version CHROM_ANNOT_DIR="${ANNOT_DIR}/Chromosome_Files_hg19"
GRCH38 version CHROM_ANNOT_DIR="${ANNOT_DIR}/Chromosome_Files_hs38d1"
EDIT_ANNOT_DIR="${ANNOT_DIR}/MultiEditor"
CADD_DATA_DIR="${ANNOT_DIR}/CADD-scripts-master"
GRCH37 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
GRCH38 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -b true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}

Run the cohort mapping and variant calling workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_pedigree_workflow.sh

The final output files can be found in the following directory:

${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_outstore

Troubleshooting

Troubleshooting within Toil can unfortunately be a very tricky task. The log files for this example run would be located in ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_workflow.log though they will likely not be the most informative to the real issue, they can act as a starting point to figuring out what really went wrong. The general practice I use when looking at toil log files is to first look at the very latest line that contains the python traceback Traceback (most recent call last):. The traceback can tell you which toil job function in which source file the error occurred in.

Also looking for ERROR lines immediately prior to the Traceback lines in the log should give helpful messages that are likely to pertain the software run within a container image that the error occurred in.

You can also get more information by escalating the logger to use Toils debugger mode and rerunning the workflow script. To do this you will need to modify the workflow running script in ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_workflow.sh to replace the line --logInfo with --logDebug.

Restarting a workflow

To rerun a workflow you can either rerun the helper script to regenerate the workflow script and rerun that script via the following (NOTE THE -r FLAG):

COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs"
CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/Chromosome_Files_hg19"
EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/MultiEditor"
CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/CADD-scripts-master"
GRCH37 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
GRCH38 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -b true -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_pedigree_workflow.sh

OR you can manually rerun it by simply deleting the toil clean command AND adding the --restart flag to the toil-vg command within the bash script file ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_workflow.sh and rerunning that workflow script:

cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_pedigree_workflow.sh

Running the Candidate Analysis Workflow

CD into cohort work directory and setup input variables. The COHORT_NAME is the name of the cohort and the PROBAND sample.

For the analysis part of the workflow a few additional inputs are required: The CHROM_ANNOT_DIR bash string needs to contain the full path to the 'Chromosome_Files_hg19' directory. The EDIT_ANNOT_DIR bash string needs to contain the full path to the 'MultiEditor' directory. The CADD_DATA_DIR bash string needs to contain the full path to the 'CADD-scripts-master' directory.

For the input variables the COHORT_PED_FILE must point to a valid .ped file in the form of the COHORT_ID.ped or PROBAND_SAMPLE_ID.ped naming scheme and must follow the tab-delimited PED file format. The .ped file needs to contain all members of the family set of samples, order doesn't matter. For example a HG002 cohort .ped file looks like the following where the proband is HG002, the father is HG003, the mother is HG004, SIB_1 is an unaffected sibling, and SIB_2 is an affected sibling:

#Family ID  Father  Mother  Sex[1=M]    Affected[2=A]
HG002   HG002   HG003   HG004   1   2
HG002   HG003   0   0   1   1
HG002   HG004   0   0   2   1
HG002   SIB_1    HG003   HG004   2   1
HG002   SIB_2    HG003   HG004   1   2

Setup input variables

COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
GRCH37 version CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/Chromosome_Files_hg19" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/CADD-scripts-master"
GRCH38 version CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/Chromosome_Files_hs38d1" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/CADD-scripts-master"

Setup workflow bash script

GRCH37 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR}
GRCH38 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -b true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR}

Run the cohort mapping and variant calling workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_analysis_workflow.sh

The final output files can be found in the following directory:

${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_analysis_outstore

Restarting the analysis workflow

To rerun a workflow you can either rerun the helper script to regenerate the workflow script and rerun that script via the following (NOTE THE -r FLAG):

COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
GRCH37 version CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/Chromosome_Files_hg19" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/CADD-scripts-master"
GRCH38 version CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/Chromosome_Files_hs38d1" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/CADD-scripts-master"

Setup workflow bash script

GRCH37 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR}
GRCH38 version ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -b true -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR}
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_analysis_workflow.sh

OR you can manually rerun it by simply deleting the toil clean command AND adding the --restart flag to the toil-vg command within the bash script file ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_analysis_workflow.sh and rerunning that workflow script:

cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_analysis_workflow.sh