This is a snakemake pipeline for preparing training, evaluation or inference data for deepSNP.
The pipeline generates VCF files with variants for given BAM samples.
Input samples should be prepared according to GATK best practices of data-preprocessing for variant discovery
The pipeline uses the Mutect2 caller in tumor-only mode, so it yields all candidate variants, including true somatic and true germline variants as well as artefacts.
Somatic variants can then be filtered out and germline variants can be labelled in output VCFs throughout postprocessing.
See rules for a more detailed description of what's going on.
-
Install miniconda.
-
Create and activate the conda environment
conda env create -f environment.yml
conda activate vcalling
- Set the required parameters in
config.yaml
(see rules for additional details).
To run the pipeline on a SLURM cluster, run ./run_slurm.sh. Make sure that the chosen cluster nodes have AVX support (see below).
Modern versions of variant callers from GATK operate much faster with AVX support. If AVX instructions aren't supported by the CPU, calling may take ages. Even if the CPU supports AVX instructions, make sure that the Mutect2 log file has the message "Using CPU-supported AVX-512 instructions", otherwise calling will be slower.