Skip to content

Filtering Without Calling via SNVCurate

Victor Mao edited this page May 8, 2020 · 6 revisions

This is a tutorial for running the filtering steps of this pipeline without calling (Mutect2, MuSE, HaplotypeCaller). This will require additional steps because the calling portion of the pipeline formats, organizes, and cleans output directories that the filtering steps rely on.

You should still following the preprocessing steps to set up Annovar databases regardless.

We provide example Bash and SLURM commands here.

1. Load an interactive session with >2G memory:

srun --pty -t 0-2:0:0 --mem 5G -p interactive /bin/bash

2. Use postProcessing/cleanFiles.py to generate the proper file structure and tumor-normal matched csv. Use these newly created directories in the following steps. Note that not all flags are mandatory; just use the ones that you have files for.

python3 postProcessing/cleanFiles.py -mutect_path [NEW_MUTECT_CALLS_DIRECTORY] -muse_path [NEW_MUSE_CALLS_DIRECTORY] -haplotypecaller_path [NEW_HAPLOTYPECALLER_CALLS_DIRECTORY] -bam_path [NEW_BAM/BAI_DIRECTORY] -out [TUMOR/NORMAL_MATCHED_CSV_OUTPUT_DIRECTORY] -csv info.csv

3. Run postProcessing/Intersect.sh.

sh postProcessing/Intersect.sh [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY] [MUSE_CALLS_DIRECTORY]

If you do not have a MuSE file, then just leave the last field blank:

sh postProcessing/Intersect.sh [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY]

4. Run postProcessing/Filter.sh. Please see the Information about relevant scripts in the ReadMe for more information about specific fields. Panel filtering is only possible with hg19 as of now. If you have a matched normal:

sh postProcessing/Filter.sh [FILTERING_OUTPUT_DIRECTORY] [HAPLOTYPECALLER_CALLS_DIRECTORY] True [TUMOR/NORMAL_MATCHED_CSV] 2 5 0.01 0.0001 hg19 [NEW_DATABASES_DIRECTORY] [RENAMED_BAMS] [ANNOVAR.pl] True [PANEL.vcf]

Otherwise, switch out the path to HaplotypeCaller with a path to a blacklist panel filter. Here, any mutation found in intersection will be removed if found in this panel. If you wish to also filter using another panel, you may include it in the /path/to/panel_2; otherwise, just input the first panel path.

sh postProcessing/Filter.sh [FILTERING_OUTPUT_DIRECTORY] [PANEL1.vcf] True [TUMOR/NORMAL_MATCHED_CSV] 2 5 0.01 0.0001 hg19 [NEW_DATABASES_DIRECTORY] [RENAMED_BAMS] [ANNOVAR.pl] True [PANEL2.vcf]

If you would rather not run the later steps of panel filtering (of which also include strict 1000G masks and removal of mutations near/at SV/InDel regions), then simply declare the field false: fields. Panel filtering is only possible with hg19 as of now. If you have a matched normal:

sh postProcessing/Filter.sh [FILTERING_OUTPUT_DIRECTORY] [HAPLOTYPECALLER_CALLS_DIRECTORY] True [TUMOR/NORMAL_MATCHED_CSV] 2 5 0.01 0.0001 hg19 [NEW_DATABASES_DIRECTORY] [RENAMED_BAMS] [ANNOVAR.pl] False 

5. Run postProcessing/Annotate.sh. If you have a matched normal:

sh postProcessing/Annotate.sh [ANNOVAR.pl] [NEW_DATABASES_DIRECTORY] [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY] hg19 [TUMOR/NORMAL_MATCHED_CSV] [HAPLOTYPECALLER_CALLS_DIRECTORY]

Otherwise:

sh postProcessing/Annotate.sh [ANNOVAR.pl] [NEW_DATABASES_DIRECTORY] [FILTERING_OUTPUT_DIRECTORY] [MUTECT_CALLS_DIRECTORY] hg19 [TUMOR/NORMAL_MATCHED_CSV] 

Script Information:

  1. cleanFiles.py: A file used to create the proper directory/file structure for the filtering portion of the pipeline. Instead of moving files, this will read the input csv file and create symbolic links.
usage: python3 cleanFiles.py [-mutect_path MUTECT_OUTPUT_PATH] [-muse_path MUSE_OUTPUT_PATH] 
                             [-haplotypecaller_path HAPLOTYPECALLER_OUTPUT_PATH] [-bam_path BAM_PATH] [-csv CSV_OF_FILES] 
                             [-out OUTPUT_DIRECTORY]
  • -bam_path: The path to the BAM files and their respective index files.
  • -csv: The csv detailing the organization of samples. See /postProcessing/cleanUp.csv for proper formatting.
  • -out: The output directory for the tumor-normal matched csv to be written to.
  1. Intersect.sh: Bash script to organize and intersect the calls by MuTecT and MuSE.
usage: sh Intersect.sh [OUTPUT_DIRECTORY] [MUTECT2_PATH] [MUSE_PATH]
  • Both the MuTecT2 and MuSE paths should be paths to the list of files directly outputted by MuTecT2 and MuSE. The script will create and organize and manipulate files on its own.
  • The MuSE path is optional, but recommended.
  1. Filter.sh: Bash script to filter the intersection of the calls.
usage: sh Filter.sh [PATH_TO_INTERSECTION] [NORMAL] [MATCHED_NORMAL] [CSV] [ALT_CUT] [TOTAL_CUT] [VAF_CUT] [MAF_CUT]
                    [REFERENCE] [ANNOVAR_DATABASES] [BAM_PATH] [ANNOVAR_SCRIPT] [FILTER_WITH_PANEL] [PANEL]
  • All fields are required unless indicated. All paths should be full paths.
  • [PATH_TO_INTERSECTION]: The full path to the directory of the intersection of the calls.
  • [NORMAL]: The full path to the directory of the normal calls from HaplotypeCaller or a Panel of Normals (ie. a sample of germline calls to filter out).
  • [MATCHED_NORMAL]: A Boolean value indicating whether or not the normal is a matched normal (ie. from GenotypeGVCFs).
  • [CSV]: Path to the original csv file containing matched tumor/normal pairs.
  • [ALT_CUT]/[VAF_CUT]: The alternate read-level depth/VAF to cut at. These will be filtered into a file with bad_somatic_quality in the filename.
  • [TOTAL_CUT]: The total read-level depth (ie. alt + ref) to cut at.
  • [MAF_CUT]: The population germline cutoff to cut at.
  • [REFERENCE]: hg19 or hg38 (for Annovar).
  • [ANNOVAR_DATABASES]: The path to the Annovar databases created from SetupDatabases.sh.
  • [BAM_PATH]: The full path to the directory of BAM files.
  • [ANNOVAR_SCRIPT]: The path to the Annovar Perl script. On Orchestra, this is /home/mk446/bin/annovar/table_annovar.pl.
  • [FILTER_WITH_PANEL]: True (if PoN filtering is desired), False (otherwise). Currently, panel filtering is only supported for hg19/b37.
  • [PANEL] (optional): The path to a Panel of Normals to filter with, if desired. For hg19/b37, the TCGA panel located at /n/data1/hms/dbmi/park/victor/references/ is recommended.
  1. Annotate.sh: Bash script to annotate the filtering results and merge them into final annotated callsets.
usage: sh Annotate.sh [ANNOVAR_SCRIPT] [ANNOVAR_DATABASES] [OUTPUT_DIRECTORY] [PATH_TO_MUTECT2] [REFERENCE] [CSV] [PATH_TO_NORMAL]
  • All paths should be full paths.
  • [ANNOVAR_SCRIPT]: The path to the Annovar Perl script. On Orchestra, this is /home/mk446/bin/annovar/table_annovar.pl.
  • [ANNOVAR_DATABASES]: The path to the Annovar databases created from SetupDatabases.sh.
  • [OUTPUT_DIRECTORY]: The same output directory used before.
  • [PATH_TO_MUTECT]: Path to MuTecT output.
  • [REFERENCE]: hg19 or hg38 (for Annovar).
  • [CSV]: Path to the original csv file containing matched tumor/normal pairs.
  • [PATH_TO_NORMAL] (optional): The full path to the directory of the normal calls from HaplotypeCaller (if used).
Clone this wiki locally