-
Notifications
You must be signed in to change notification settings - Fork 14
Running a Whole Genome Pedigree Dataset (NIH Biowulf)
cd into a directory with at least 2 TB of allocated Disk space
cd /data/$USER
Launch an interactive session on Biowulf and load requisite Biowulf modules:
sinteractive
module load git python/3.7
Clone the github repo and create a work directory for running the toil-vg pedigree workflow:
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
mkdir -p ${TOIL_VG_DIR} && cd ${TOIL_VG_DIR}
git clone --single-branch --branch vg_pedigree_workflow https://github.com/vgteam/toil-vg.git
Download workflow inputs and set up toil-vg virtual environment to run toil-vg workflows:
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_toil_vg.sh -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
exit
Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME
from this template. The COHORT_NAME
should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST
bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.
COHORT_INPUT_DATA
should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND
and SIBLING_1
are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz
and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz
respectively, then the path for COHORT_INPUT_DATA
should be /data/Udpdata/Individuals/
.
COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS/"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}
CD into cohort work directory and setup input variables.
The SIBLING_ID_LIST
bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND
and 2 additional siblings UDP_SIB_1
and UDP_SIB_2
:
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").
For one of the input variables the PED_FILE
must point to a valid .ped
file in the form of the COHORT_ID.ped
or PROBAND_SAMPLE_ID.ped
naming scheme and must follow the tab-delimited PED file format. The .ped
file needs to only contain the mother-father-proband trio set of samples. For example the HG002 trio .ped
file looks like the following where the proband is HG002
the father is HG003
and the mother is HG004
:
#Family ID Father Mother Sex[1=M] Affected[2=A]
HG002 HG002 HG003 HG004 1 2
HG002 HG003 0 0 1 1
HG002 HG004 0 0 2 1
Setup input variables
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"
Setup workflow bash script
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
Run the cohort mapping and variant calling workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
The final output files can be found in the following directory:
${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_pedigree_outstore
Troubleshooting within Toil can unfortunately be a very tricky task. The log files for this example run would be located in ${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}_pedigree_workflow.log
though they will likely not be the most informative to the real issue, they can act as a starting point to figuring out what really went wrong.
The general practice I use when looking at toil log files is to first look at the very latest line that contains the python traceback Traceback (most recent call last):
. The traceback can tell you which toil job function in which source file the error occurred in.
Also looking for ERROR
lines immediately prior to the Traceback
lines in the log should give helpful messages that are likely to pertain the software run within a container image that the error occurred in.
You can also get more information by escalating the logger to use Toils debugger mode and rerunning the workflow script. To do this you will need to modify the workflow running script in ${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}_pedigree_workflow.sh
to replace the line --logInfo
with --logDebug
.
To rerun a workflow you can either rerun the helper script to regenerate the workflow script and rerun that script via the following (NOTE THE -r
FLAG):
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR} -r
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
OR you can manually rerun it by simply adding the --restart
flag to the file ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
and rerunning that workflow script:
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh