Skip to content

Filtering with Calling via SNVCurate

Victor Mao edited this page May 14, 2020 · 4 revisions

Post-Processing tutorial (under SNVCurate/postProcessing/):

This is a tutorial for processing the calls made by the scripts in SNVCurate. Note that Filter.sh and Annotate.sh will submit SLURM batch jobs, which you should wait for to finish until moving to the next step. To change the parameters of the SLURM batch jobs (ie. time, queue, node count, etc.), change the parameters of the headers of the scripts runRenaming.sh, postProcessing/runAnnotate.sh, and postProcessing/RunFilter.sh.

1. Load an interactive session with >2G memory:

srun --pty -t 0-2:0:0 --mem 5G -p interactive /bin/bash

2. Run postProcessing/Intersect.sh.

sh postProcessing/Intersect.sh [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY] [MUSE_CALLS_DIRECTORY]

If you do not have a MuSE file, then just leave the last field blank:

sh postProcessing/Intersect.sh [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY]

3. Run postProcessing/Filter.sh. Please see the Information about relevant scripts in the ReadMe for more information about specific fields. Panel filtering is only possible with hg19 as of now. If you have a matched normal:

sh postProcessing/Filter.sh [FILTERING_OUTPUT_DIRECTORY] [HAPLOTYPECALLER_CALLS_DIRECTORY] True [TUMOR/NORMAL_MATCHED_CSV] 2 5 0.01 0.0001 hg19 [NEW_DATABASES_DIRECTORY] [RENAMED_BAMS] [ANNOVAR.pl] True [PANEL.vcf]

Otherwise, switch out the path to HaplotypeCaller with a path to a blacklist panel filter. Here, any mutation found in intersection will be removed if found in this panel. If you wish to also filter using another panel, you may include it in the /path/to/panel_2; otherwise, just input the first panel path.

sh postProcessing/Filter.sh [FILTERING_OUTPUT_DIRECTORY] [PANEL1.vcf] False [TUMOR/NORMAL_MATCHED_CSV] 2 5 0.01 0.0001 hg19 [NEW_DATABASES_DIRECTORY] [RENAMED_BAMS] [ANNOVAR.pl] True [PANEL2.vcf]

If you would rather not run the later steps of panel filtering (of which also include strict 1000G masks and removal of mutations near/at SV/InDel regions), then simply declare the field false: fields. Panel filtering is only possible with hg19 as of now. If you have a matched normal:

sh postProcessing/Filter.sh [FILTERING_OUTPUT_DIRECTORY] [HAPLOTYPECALLER_CALLS_DIRECTORY] True [TUMOR/NORMAL_MATCHED_CSV] 2 5 0.01 0.0001 hg19 [NEW_DATABASES_DIRECTORY] [RENAMED_BAMS] [ANNOVAR.pl] False 

4. Run postProcessing/Annotate.sh. If you have a matched normal:

sh postProcessing/Annotate.sh [ANNOVAR.pl] [NEW_DATABASES_DIRECTORY] [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY] hg19 [TUMOR/NORMAL_MATCHED_CSV] [HAPLOTYPECALLER_CALLS_DIRECTORY]

Otherwise:

sh postProcessing/Annotate.sh [ANNOVAR.pl] [NEW_DATABASES_DIRECTORY] [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY] hg19 [TUMOR/NORMAL_MATCHED_CSV] 

Script Information

  1. Intersect.sh: Bash script to organize and intersect the calls by MuTecT and MuSE.
usage: sh Intersect.sh [OUTPUT_DIRECTORY] [MUTECT2_PATH] [MUSE_PATH]
  • Both the MuTecT2 and MuSE paths should be paths to the list of files directly outputted by MuTecT2 and MuSE. The script will create and organize and manipulate files on its own.
  • The MuSE path is optional, but recommended.
  1. Filter.sh: Bash script to filter the intersection of the calls.
usage: sh Filter.sh [PATH_TO_INTERSECTION] [NORMAL] [MATCHED_NORMAL] [CSV] [ALT_CUT] [TOTAL_CUT] [VAF_CUT] [MAF_CUT]
                    [REFERENCE] [ANNOVAR_DATABASES] [BAM_PATH] [ANNOVAR_SCRIPT] [FILTER_WITH_PANEL] [PANEL]
  • All fields are required unless indicated. All paths should be full paths.
  • [PATH_TO_INTERSECTION]: The full path to the directory of the intersection of the calls.
  • [NORMAL]: The full path to the directory of the normal calls from HaplotypeCaller or a Panel of Normals (ie. a sample of germline calls to filter out).
  • [MATCHED_NORMAL]: A Boolean value indicating whether or not the normal is a matched normal (ie. from GenotypeGVCFs).
  • [CSV]: Path to the original csv file containing matched tumor/normal pairs.
  • [ALT_CUT]/[VAF_CUT]: The alternate read-level depth/VAF to cut at. These will be filtered into a file with bad_somatic_quality in the filename.
  • [TOTAL_CUT]: The total read-level depth (ie. alt + ref) to cut at.
  • [MAF_CUT]: The population germline cutoff to cut at.
  • [REFERENCE]: hg19 or hg38 (for Annovar).
  • [ANNOVAR_DATABASES]: The path to the Annovar databases created from SetupDatabases.sh.
  • [BAM_PATH]: The full path to the directory of BAM files.
  • [ANNOVAR_SCRIPT]: The path to the Annovar Perl script. On Orchestra, this is /home/mk446/bin/annovar/table_annovar.pl.
  • [FILTER_WITH_PANEL]: True (if PoN filtering is desired), False (otherwise). Currently, panel filtering is only supported for hg19/b37.
  • [PANEL] (optional): The path to a Panel of Normals to filter with, if desired. For hg19/b37, the TCGA panel located at /n/data1/hms/dbmi/park/victor/references/ is recommended.
  1. Annotate.sh: Bash script to annotate the filtering results and merge them into final annotated callsets.
usage: sh Annotate.sh [ANNOVAR_SCRIPT] [ANNOVAR_DATABASES] [OUTPUT_DIRECTORY] [PATH_TO_MUTECT2] [REFERENCE] [CSV] [PATH_TO_NORMAL]
  • All paths should be full paths.
  • [ANNOVAR_SCRIPT]: The path to the Annovar Perl script. On Orchestra, this is /home/mk446/bin/annovar/table_annovar.pl.
  • [ANNOVAR_DATABASES]: The path to the Annovar databases created from SetupDatabases.sh.
  • [OUTPUT_DIRECTORY]: The same output directory used before.
  • [PATH_TO_MUTECT]: Path to MuTecT output.
  • [REFERENCE]: hg19 or hg38 (for Annovar).
  • [CSV]: Path to the original csv file containing matched tumor/normal pairs.
  • [PATH_TO_NORMAL] (optional): The full path to the directory of the normal calls from HaplotypeCaller (if used).