-
Notifications
You must be signed in to change notification settings - Fork 14
Running a Whole Genome Pedigree Dataset (NIH Biowulf)
To start, launch an interactive session on Biowulf and load requisite Biowulf modules for environment and input data setup:
sinteractive
module load git python/3.7
cd into a directory with at least 2 TB of allocated Disk space
cd /data/$USER
Clone the github repo and create a work directory for running the toil-vg pedigree workflow:
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
mkdir -p ${TOIL_VG_DIR} && cd ${TOIL_VG_DIR}
git clone --single-branch --branch vg_pedigree_workflow_dev https://github.com/vgteam/toil-vg.git
Download workflow inputs and set up toil-vg virtual environment to run toil-vg workflows:
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_toil_vg.sh -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME
from this template. The COHORT_NAME
should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST
bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.
COHORT_INPUT_DATA
should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND
and SIBLING_1
are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz
and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz
respectively, then the path for COHORT_INPUT_DATA
should be /data/Udpdata/Individuals/
.
COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS/"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}
If the read setup script doesn't work for your given data, you can manually add them or soft-link them to the following file structure. The read data should be organized as the following for the previously-given example cohort names:
${COHORT_WORKFLOW_DIR}/input_reads/UDP_MATERNAL_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_MATERNAL_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PATERNAL_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PATERNAL_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PROBAND_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_PROBAND_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_1_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_1_read_pair_2.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_2_read_pair_1.fq.gz
${COHORT_WORKFLOW_DIR}/input_reads/UDP_SIB_2_read_pair_2.fq.gz
CD into cohort work directory and setup input variables.
The SIBLING_ID_LIST
bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND
and 2 additional siblings UDP_SIB_1
and UDP_SIB_2
:
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").
For the analysis part of the workflow a few additional inputs are required:
The SIBLING_GENDERS
bash array needs to list the proband and sibling genders (0 for male, 1 for female). Must follow same order as SIBLING_ID_LIST
.
The SIBLING_AFFECTED
bash array needs to list the proband and sibling affected status (0 for unaffected, 1 for affected). Must follow same order as SIBLING_ID_LIST
.
The CHROM_ANNOT_DIR
bash string needs to contain the full path to the 'Chromosome_Files_hg19' directory.
The EDIT_ANNOT_DIR
bash string needs to contain the full path to the 'MultiEditor' directory.
The CADD_DATA_DIR
bash string needs to contain the full path to the 'CADD-scripts-master' directory.
For one of the input variables the PED_FILE
must point to a valid .ped
file in the form of the COHORT_ID.ped
or PROBAND_SAMPLE_ID.ped
naming scheme and must follow the tab-delimited PED file format. The .ped
file needs to only contain the mother-father-proband trio set of samples. For example the HG002 trio .ped
file looks like the following where the proband is HG002
the father is HG003
and the mother is HG004
:
#Family ID Father Mother Sex[1=M] Affected[2=A]
HG002 HG002 HG003 HG004 1 2
HG002 HG003 0 0 1 1
HG002 HG004 0 0 2 1
Setup input variables
NOTE (FOR INTRAMURAL USERS) Some alternative workflow inputs can be as follows:
WORKFLOW_INPUT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/toil_vg_inputs"
CHROM_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/Chromosome_Files_hg19"
EDIT_ANNOT_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/MultiEditor"
CADD_DATA_DIR="/data/Udpbinfo/Scratch/markellocj/toil_vg_workflow_inputs/pedigree_analysis_inputs/CADD-scripts-master"
MATERNAL_SAMPLE_NAME="UDP_MATERNAL"
PATERNAL_SAMPLE_NAME="UDP_PATERNAL"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
SIBLING_GENDERS=(0 0 0)
SIBLING_AFFECTED=(1 0 0)
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"
CHROM_ANNOT_DIR="/PATH/TO/Chromosome_Files_hg19"
EDIT_ANNOT_DIR="/PATH/TO/MultiEditor"
CADD_DATA_DIR="/PATH/TO/CADD-scripts-master"
Setup workflow bash script
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -i "${SIBLING_GENDERS[*]}" -b "${SIBLING_AFFECTED[*]}" -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
Run the cohort mapping and variant calling workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
The final output files can be found in the following directory:
${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_pedigree_outstore
Troubleshooting within Toil can unfortunately be a very tricky task. The log files for this example run would be located in ${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}_pedigree_workflow.log
though they will likely not be the most informative to the real issue, they can act as a starting point to figuring out what really went wrong.
The general practice I use when looking at toil log files is to first look at the very latest line that contains the python traceback Traceback (most recent call last):
. The traceback can tell you which toil job function in which source file the error occurred in.
Also looking for ERROR
lines immediately prior to the Traceback
lines in the log should give helpful messages that are likely to pertain the software run within a container image that the error occurred in.
You can also get more information by escalating the logger to use Toils debugger mode and rerunning the workflow script. To do this you will need to modify the workflow running script in ${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}_pedigree_workflow.sh
to replace the line --logInfo
with --logDebug
.
To rerun a workflow you can either rerun the helper script to regenerate the workflow script and rerun that script via the following (NOTE THE -r
FLAG):
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
SIBLING_GENDERS=(0 0 0)
SIBLING_AFFECTED=(1 0 0)
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_toil_vg_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_toil_vg_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"
CHROM_ANNOT_DIR="/PATH/TO/Chromosome_Files_hg19"
EDIT_ANNOT_DIR="/PATH/TO/MultiEditor"
CADD_DATA_DIR="/PATH/TO/CADD-scripts-master"
${TOIL_VG_DIR}/toil-vg/scripts/vg_pedigree_scripts/setup_pedigree_script.sh -r true -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -i "${SIBLING_GENDERS[*]}" -b "${SIBLING_AFFECTED[*]}" -a ${CHROM_ANNOT_DIR} -e ${EDIT_ANNOT_DIR} -d ${CADD_DATA_DIR} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${TOIL_VG_DIR}
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
OR you can manually rerun it by simply adding the --restart
flag to the toil-vg
command within the bash script file ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
and rerunning that workflow script:
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
sinteractive
module load git python/3.7
TOIL_VG_DIR="/data/$USER/test_toil_vg_run/toil_vg_tools"
if [ ! -d "${TOIL_VG_DIR}" ]; then
mkdir -p ${TOIL_VG_DIR}
chmod 2770 ${TOIL_VG_DIR}
fi
cd ${TOIL_VG_DIR}
git clone --single-branch --branch vg_pedigree_workflow_dev https://github.com/vgteam/toil-vg.git
git clone https://github.com/cmarkello/toil.git
python3 -m venv toilvg_venv
source toilvg_venv/bin/activate
pip install ./toil
pip install ./toil-vg
deactivate