Collection of scripts used for the paper
Sebastian Niehus, Hákon Jónsson, Janina Schönberger, Eythór Björnsson, Doruk Beyter, Hannes P. Eggertsson, Patrick Sulem, Kári Stefánsson, Bjarni V. Halldórsson, Birte Kehr.
PopDel identifies medium-size deletions jointly in tens of thousands of genomes.
Available at Nature Communications: https://www.nature.com/articles/s41467-020-20850-5
All generated output VCF/BCF files are published at Zenodo; DOI: https://doi.org/10.5281/zenodo.3992607
Content
This repository contains the scripts used for the comparisons and evaluations of the paper PopDel calls deletions jointly in tens of thousands of genomes. It does not contain the public data of of the Polaris HiSeqX Diversity Cohort, Polaris Kids Cohort or the Illumina Platinum Genome NA12878. They can be obtained from the respective online sources. Running the scripts requires you to adapt the paths in some of them.
- Random Deletion Data Simulation
- G1k Deletion Data Simulation
- Call Set comparison of Different Variant Callers on HG002 Trio
- Call Set comparison of Different Variant Callers on NA12878
- Polaris HiSeqX Diversity Cohort
- Polaris Kids Cohort
Simulation/uniform_simulation/deletion_data_Simulation/uniform_simulation/
- simulate_deletion_haplotype.py: Simulation of haplotypes.
- simulate_deletions.py: Simulation of reads of the samples
- simulate.sh: Wrapping above scripts and aligning the simulated reads to the GRCh38.
- simulation2vcf.py: Writing the simulated deletions to VCF-files.
Note on random seeds: We used 0 as random seed to simulate 2000 deletions with the script 'simulate_deletions.txt'. The random seed used for the simulation of reads using 'simulate.sh' was 1 for the first sample, 2 for the the second sample and so on.
Simulation/uniform_simulation/delly/
- generateDelly.sh: generates the scripts for each batch size.
- runDelly.sh: Runs the scripts generated by generateDelly.sh and measures the resource consumption.
Simulation/uniform_simulation/gridss/
- generateGridss.sh: Generates the scripts for each batch size.
- runGridss.sh: Calls all the scripts generated by generateGridss.sh and measures the resource consumption.
- gridss.filter.sh: Deduplicates the break ends called by GRIDSS and annotates the VCFs.
- annotate.R: Annotation script called by gridss.filter.sh.
Simulation/uniform_simulation/smoove/
- run_smoove.sh: Runs the scripts smoove_single.sh, smoove_1.sh, smoove_2.sh, smoove_3.sh, smoove_4.sh and measures the resource consumption.
- smoove_filter.sh: Filters Smoove's call set.
- smoove.env: Contains the conda environment for Smoove.
- exclude.bed: Contains the regions Smoove should ignore.
Simulation/uniform_simulation/podel/
- popdelProfile.sh: Contains the commands for creating a profile of each bam file.
- runPopDel.sh: Runs all the scripts and measures the resource consumption.
Simulation/uniform_simulation/truth/ Contains an archive (truth.tar.gz) of all simulated variants for each batch size used for the evaluation. The files are the results of the deletion data simulation with above mentioned random seeds.
Simulation/uniform_simulation/plots/
- eval_bed.sh: Evaluates the TP/FP/FN of all tools. Note that the range of the evaluation loop might have to be adjusted to match the batch sizes that the respective tools actually processed successfully.
- compare_results.py: Script for alternative evaluation, based on fixed positional and size-estimate margins. can also consider genotypes of individual samples.
- eval.sh: Wrapper for compare_results.py
- simulation_plots.R: Script for generating all plots of the simulated data.
Simulation/uniform_simulation/deletion_data_Simulation/g1k_simulation/
- Simulation/uniform_simulation/deletion_data_Simulation/g1k_simulation/additionalFiles: Contains the G1k deletion reference set and exclusion files for smoove.
- Simulation/uniform_simulation/deletion_data_Simulation/g1k_simulation/environments: Contains the conda environments for Delly, Manta and Smoove.
- Snakefile.wholeGenome.deletion.simulation: Snakefile for use with Snakemake. Manages the complete simulation and calling workflow for all tools.
- config.yml: Configuration for the corresponding Snakefile.
- eval_full.sh: Compares the VCFs of predicted deletions with the simulated deletions.
- simulation_g1k_plots.R: Takes the tables created by eval_full.sh and plots the results and creates additional tables for comparisons.
- simulate_deletion_haplotype.py: Simulates a variant haplotype sequence from a given deletion. Used by Snakefile.wholeGenome.deletion.simulation.
- simulation2vcf.py: Writes the simulated variants in VCF format. Used by Snakefile.wholeGenome.deletion.simulation.
- compare_results.py: Compares the predicted to simulated variants for on sample. Used by eval_full.sh.
- delly.environment.yml: Contains the conda environment for Delly.
- Snakefile_call.delly.giab: Snakefile to be used with Snakemake to manage Delly's workflow.
- config.yml: Configuration for the corresponding Snakefile.
- filter_delly.sh: Filters Delly's call sets.
- human.hg19.excl.tsv: Contains the regions Delly should ignore.
- manta.env: Contains the conda environment for Manta.
- Snakefile_manta.giab: Snakefile to be used with Snakemake to manage Manta's workflow.
- config.yml: Configuration for the corresponding Snakefile.
- filter_manta.sh: Filters Manta's call sets.
- Snakefile_GIAB_popdel: Snakefile to be used with Snakemake to manage Manta's workflow.
- config.yml: Configuration for the corresponding Snakefile.
- filter_manta.sh: Filters PopDel's call sets.
- sampling.regions: Contains the regions PopDel uses for sampling the background distribution. If the option "-r grch37" is used, this file is not required and the sampling is performed on the same regions.
- maxCov.tsv: File containing the desired maximum coverage for PopDel to consider for each of the three genomes. The selected values correspond to 3x each samples mean coverage.
- all.GRCh37.profiles: File containing the paths to the profiles of the three genomes.
- mendelianError: Contains scripts for calculation and plotting of the Mendelian inheritance error.
- precision_recall: Contains the scripts for calculation and plotting of precision-recall curve
- venn_diagram: Contains scripts for calculation and plotting of the venn diagrams.
-
delly.environment.yml: Containing the conda environment for Delly.
-
Snakefile_call.delly.platinum: Snakefile to be used with Snakemake to manage Delly's workflow.
-
human.hg38.excl.tsv: Contains the regions Delly should ignore.
-
manta.environment.yml: Containing the conda environment for Manta.
-
Snakefile_manta.platinum: Snakefile to be used with Snakemake to manage Manta's workflow.
-
config.yaml: Configuration for the corresponding Snakefile.
-
GRCh38.regions.bed.gz: Contains the regions Manta should ignore.
-
GRCh38.regions.bed.gz.tbi: Index of GRCh38.regions.bed.gz.
- Snakefile_call.smoove.platinum: Snakefile to be used with Snakemake to manage Smoove's workflow.
- config.yml: Configuration for the corresponding Snakefile.
- smoove.env: File containing the conda environment for Smoove.
- exclude.cnvnator_100bp.GRCh38.20170403.bed: Contains the regions Smoove should ignore.
- profile.sh: Contains the command used for generating the profile of NA12878.
- platinum.profiles: Contains the location of the profile generated by above command. Used as input for PopDel call.
- Snakefile_call.popdel.platinum: Snakefile to be used with Snakemake to manage the PopDel call commands for each chromosome.
- contromeres.bed
- pacbio/remapped_NA12878_pacbio_deduplicated_deletions.sort.all.bed: Set of high quality reference deletions for NA12878 based on PacBio long reads. Based on variants from GiaB.
- personalis/remapped_NA12878_personalis_deduplicated_deletions.sort.all.bed: Set of high quality reference deletions for NA12878 based on Illumina short reads. Based on variants from GiaB.
- VennDiagNA12878.R: Creates Venn diagrams for the call sets of PopDel, Delly and Lumpy for NA12878. Works on the output of bedtools/Snakefile.intersect
- Snakefile.intersect: Snakefile for use with Snakemake. Manages the BED-conversion and overlap calculations via bedtools intersect.
- config.yaml: Configuration of the evaluation.
polaris_diversity_cohort/delly/
- config.yaml: Configuration for the corresponding Snakefile.
- environment.yaml: Contains the conda environment for Delly.
- Snakefile_polaris_delly: Snakefile for use with Snakemake. Manages Delly's workflow.
- human.hg38.excl.tsv: Contains the regions Delly should ignore.
polaris_diversity_cohort/smoove/
- config.yaml: Configuration for corresponding Snakefile.
- smoove.env: Contains the conda environment for Smoove.
- Snakefile_smoove_polaris150: Snakefile for use with Snakemake. Manages Smoove's workflow.
- exclude.cnvnator_100bp.GRCh38.20170403.bed: Contains the regions Smoove should ignore..
polaris_diversity_cohort/popdelProfile/
- config.yaml: Configuration for corresponding Snakefile.
- Snakefile_polaris_profile: Snakefile for use with Snakemake. Manages the creation of the profiles for all samples.
polaris_diversity_cohort/popdelCall/
- Snakefile_rnd_150: Snakefile for use with Snakemake. Applies PopDel call on each chromosome of all samples jointly.
polaris_diversity_cohort/plots/
- extract_calls.sh: Commands for transforming the calls of the tools to the allele-count matrix required by pca_boxplot_varCount.R for the PCA.
- extract_popdel_per_sample_variants_GT26.sh: Counts PopDel's deletions per sample for the given genotype quality (26).
- ancestry.csv: Lists the ancestry of each sample.
- pca_boxplot_varCount.R: Script for creating the PCA-plots for all tools and the box-plot and variant counts for PopDel
- environment.yaml: Conda environment for Delly.
- config.yaml: Contains the configuration for the corresponding Snakefile.
- Snakefile_polaris_kids_delly: Manages Delly's workflow.
- smoove.env: Conda environment for Smoove.
- config.yaml: Contains the configuration for the corresponding Snakefile.
- Snakefile_smoove_polarisKids: Manages Smoove's workflow.
- exclude.cnvnator_100bp.GRCh38.20170403.bed: Contains the regions Smoove should ignore.
- family_kids.profiles: Shuffled list of paths to profiles created by PopDel profile.
- Snakefile_polaris_kids_popdelSnakefile_polaris_kids: Snakefile for use with Snakemake. Manages PopDel's workflow.
- hwe.py: Filters the VCF files according to the Hardy-Weinberg-Equilibrium.
- transmission_ntrio.py: Calculates the transmission rates and Mendenlian inheritance error rates on the HWE-filtered VCFs.
- kids.ped: Contains the pedigree information of the kids cohort. One trio <Parent1, Parent2, Child> per line.
- tr-mendel.R: Script for creating the transmission rate plots and plots of Mendelian inheritance error.