-
Notifications
You must be signed in to change notification settings - Fork 14
Running a Whole Genome Pedigree Dataset (NIH Biowulf & DeepTrio, Giraffe support)
- Approximately 6 TB of Disk space is needed to temporarily compute a whole genome quartet family. Improvements to these requirements are in development.
- HPC scheduler batch system needs to be slurm-based in order for the configuration of these helper scripts to work. Otherwise you will need to modify the
--batchSystem
argument in the bash scripts created by these helper scripts. - Read-access to the
/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs
directory where the candidate analysis workflow inputs are located. - Read-access to the
/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs
directory where the candidate analysis workflow inputs are located.
To start, launch an interactive session on Biowulf and load requisite Biowulf modules for environment and input data setup:
sinteractive
module load git python/3.7
cd into a directory with at least 6 TB of allocated Disk space
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
if [ ! -d "${TOIL_VG_DIR}" ]; then
mkdir -p ${TOIL_VG_DIR}
chmod 2770 ${TOIL_VG_DIR}
fi
cd ${TOIL_VG_DIR}
Download and install toil-vg software
rm -r toil-vg
GRCH37 version | git clone --single-branch --branch vg_pedigree_workflow_deepvariant https://github.com/vgteam/toil-vg.git |
---|---|
GRCH38 version | git clone --single-branch --branch vg_pedigree_workflow_deepvariant_dev https://github.com/vgteam/toil-vg.git |
git clone https://github.com/cmarkello/toil.git
python3 -m venv toilvg_venv
source toilvg_venv/bin/activate
pip install ./toil
pip install ./toil-vg
deactivate
Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME
from this template. The COHORT_NAME
should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST
bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.
COHORT_INPUT_DATA
should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND
and SIBLING_1
are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz
and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz
respectively, then the path for COHORT_INPUT_DATA
should be /data/Udpdata/Individuals/
.
COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS/"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}
If the read setup script doesn't work for your given data, you can manually add them or soft-link them to the following file structure. The read data should be organized as the following for the previously-given example cohort names:
${COHORT_WORKFLOW_DIR}/input_reads/UDP_MATERNAL_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_MATERNAL_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PATERNAL_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PATERNAL_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PROBAND_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PROBAND_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_1_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_1_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_2_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_2_read_pair_2.fq.gz
CD into cohort work directory and setup input variables.
The COHORT_NAME
is the name of the cohort and the PROBAND sample.
For the analysis part of the workflow a few additional inputs are required:
The CHROM_ANNOT_DIR
bash string needs to contain the full path to the 'Chromosome_Files_hg19' directory.
The EDIT_ANNOT_DIR
bash string needs to contain the full path to the 'MultiEditor' directory.
The CADD_DATA_DIR
bash string needs to contain the full path to the 'CADD-scripts-master' directory.
For the input variables the COHORT_PED_FILE
must point to a valid .ped
file in the form of the COHORT_ID.ped
or PROBAND_SAMPLE_ID.ped
naming scheme and must follow the tab-delimited PED file format. The .ped
file needs to contain all members of the family set of samples, order doesn't matter. For example a HG002 cohort .ped
file looks like the following where the proband is HG002
, the father is HG003
, the mother is HG004
, SIB_1
is an unaffected sibling, and SIB_2
is an affected sibling:
#Family ID Father Mother Sex[1=M] Affected[2=A]
HG002 HG002 HG003 HG004 1 2
HG002 HG003 0 0 1 1
HG002 HG004 0 0 2 1
HG002 SIB_1 HG003 HG004 2 1
HG002 SIB_2 HG003 HG004 1 2
Setup input variables
COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
GRCH37 version | `WORKFLOW_INPUT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs"``` |
---|---|
GRCH38 version | WORKFLOW_INPUT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs/grch38_inputs" |
GRCH37 version | ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs" |
---|---|
GRCH38 version | ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38" |
GRCH37 version | CHROM_ANNOT_DIR="${ANNOT_DIR}/Chromosome_Files_hg19" |
---|---|
GRCH38 version | CHROM_ANNOT_DIR="${ANNOT_DIR}/Chromosome_Files_hs38d1" |
EDIT_ANNOT_DIR="${ANNOT_DIR}/MultiEditor"
CADD_DATA_DIR="${ANNOT_DIR}/CADD-scripts-master"
GRCH37 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR} |
---|---|
GRCH38 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -b true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR} |
Run the cohort mapping and variant calling workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_pedigree_workflow.sh
The final output files can be found in the following directory:
${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_outstore
Troubleshooting within Toil can unfortunately be a very tricky task. The log files for this example run would be located in ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_workflow.log
though they will likely not be the most informative to the real issue, they can act as a starting point to figuring out what really went wrong.
The general practice I use when looking at toil log files is to first look at the very latest line that contains the python traceback Traceback (most recent call last):
. The traceback can tell you which toil job function in which source file the error occurred in.
Also looking for ERROR
lines immediately prior to the Traceback
lines in the log should give helpful messages that are likely to pertain the software run within a container image that the error occurred in.
You can also get more information by escalating the logger to use Toils debugger mode and rerunning the workflow script. To do this you will need to modify the workflow running script in ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_workflow.sh
to replace the line --logInfo
with --logDebug
.
To rerun a workflow you can either rerun the helper script to regenerate the workflow script and rerun that script via the following (NOTE THE -r
FLAG):
COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs"
CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/Chromosome_Files_hg19"
EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/MultiEditor"
CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/CADD-scripts-master"
GRCH37 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR} |
---|---|
GRCH38 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -b true -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR} |
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_pedigree_workflow.sh
OR you can manually rerun it by simply deleting the toil clean
command AND adding the --restart
flag to the toil-vg
command within the bash script file ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_pedigree_workflow.sh
and rerunning that workflow script:
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_pedigree_workflow.sh
CD into cohort work directory and setup input variables.
The COHORT_NAME
is the name of the cohort and the PROBAND sample.
For the analysis part of the workflow a few additional inputs are required:
The CHROM_ANNOT_DIR
bash string needs to contain the full path to the 'Chromosome_Files_hg19' directory.
The EDIT_ANNOT_DIR
bash string needs to contain the full path to the 'MultiEditor' directory.
The CADD_DATA_DIR
bash string needs to contain the full path to the 'CADD-scripts-master' directory.
For the input variables the COHORT_PED_FILE
must point to a valid .ped
file in the form of the COHORT_ID.ped
or PROBAND_SAMPLE_ID.ped
naming scheme and must follow the tab-delimited PED file format. The .ped
file needs to contain all members of the family set of samples, order doesn't matter. For example a HG002 cohort .ped
file looks like the following where the proband is HG002
, the father is HG003
, the mother is HG004
, SIB_1
is an unaffected sibling, and SIB_2
is an affected sibling:
#Family ID Father Mother Sex[1=M] Affected[2=A]
HG002 HG002 HG003 HG004 1 2
HG002 HG003 0 0 1 1
HG002 HG004 0 0 2 1
HG002 SIB_1 HG003 HG004 2 1
HG002 SIB_2 HG003 HG004 1 2
Setup input variables
COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
GRCH37 version | CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/Chromosome_Files_hg19" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/CADD-scripts-master" |
---|---|
GRCH38 version | CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/Chromosome_Files_hs38d1" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/CADD-scripts-master" |
Setup workflow bash script
GRCH37 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR} |
---|---|
GRCH38 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -b true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR} |
Run the cohort mapping and variant calling workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_analysis_workflow.sh
The final output files can be found in the following directory:
${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_analysis_outstore
To rerun a workflow you can either rerun the helper script to regenerate the workflow script and rerun that script via the following (NOTE THE -r
FLAG):
COHORT_NAME="UDP****"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_PED_FILE="${COHORT_WORKFLOW_DIR}/${COHORT_NAME}.ped"
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
GRCH37 version | CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/Chromosome_Files_hg19" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/CADD-scripts-master" |
---|---|
GRCH38 version | CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/Chromosome_Files_hs38d1" EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/MultiEditor" CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs_grch38/CADD-scripts-master" |
Setup workflow bash script
GRCH37 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR} |
---|---|
GRCH38 version | ${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_analysis_script.sh -b true -r true -f ${COHORT_NAME} -c ${COHORT_PED_FILE} -w ${COHORT_WORKFLOW_DIR} -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -v ${TOIL_VG_DIR} |
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_analysis_workflow.sh
OR you can manually rerun it by simply deleting the toil clean
command AND adding the --restart
flag to the toil-vg
command within the bash script file ${COHORT_WORKFLOW_DIR}/${COHORT_NAME}_analysis_workflow.sh
and rerunning that workflow script:
cd ${COHORT_WORKFLOW_DIR}
sbatch --no-requeue --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${COHORT_NAME}_analysis_workflow.sh