Visualization and control-independent classification tool of noncancer (somatic or germline) mosaic single nucleotide variants (SNVs) with deep convolutional neural networks. Originally written by Virginia (Xin) Xu and Xiaoxu Yang, maintained by Arzoo Patel.
-Step 2. Prediction for mosaicism (DeepMosaic Classification Module)
- DeepMosaic Visualization Module: Information of aligned sequences for any SNV represented with an RGB image:
An RGB image was used to represent the pileup results for all the reads aligned to a single genomic position. Reads supporting different alleles were grouped, in the order of the reference allele, the first, second, and third alternative alleles, respectively. Red channel was used to represent the bases, green channel for the base qualities, and blue channel for the strand orientations of the read. Note that the green channel is modified to show better contrast for human eyes.
- DeepMosaic Classification Module: Workflow from variant to result (10 models were compared and Efficientnet b4 was selected as default because it performed the best on a gold standard benchmark dataset.):
Workflow of DeepMosaic on best-performed deep convolutional neural network model after benchmarking. Variants were first transformed into images based on the alignment information. A deep convolution neural network then extracted the high-dimensional information from the image, and experimental, genomic, and population-related information was further incorporated into the classifier.
- Python 3.7
- git-lfs for the system you work on
- BEDTools (command line)
- ANNOVAR (command line)
- PyTables
- Matplotlib
- pandas
- Pysam
- PyTorch version>=1.6.0
- EfficientNet PyTorch version>=0.7.1
- argparse
Some of the versions of packages are provided as an example in this list.
Alternatively, you can use the singularity container. See Singularity.
We are now providing a singularity image to run DeepMosaic. If you want to install and run DeepMosaic manually, please read through and follow these steps. The following steps could be performed in a command line shell environment (Linux, Mac, Windows subsystem Linux etc., whichever has the computational resource and >20G storage to run DeepMosaic)
Make sure you have git-lfs installed in your environment (download git-lfs, unzip the tar.gz and put the binary file git-lfs
in your bin folder/your $PATH, and run git lfs install
to initialize git-lfs, you only need to do it once) to be able to download this repository correctly.
> git clone --recursive https://github.com/shishenyxx/DeepMosaic
Make sure you cloned the whole repository, total folder size should be ~ 4G.
> cd DeepMosaic
> conda install -c bioconda bedtools
a) Go to the ANNOVAR website and click "here" to register and download the annovar distribution.
b) Once you have sucessfully download ANNOVAR package, run
> cd [path to ANNOVAR]
> perl ./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad_genome humandb/
to intall the hg19.gnomad_genome file needed for the feature extraction from the bam file
This step is used for the extraction of genomic features of the variant from raw bams as well as population information. It can serve as an independent tool for the visualization and evaluation of mosaic candidates.
> [DeepMosaic Path]/deepmosaic/deepmosaic-draw -i <input.txt> -o <output_dir> -a <path_to_ANNOVAR> -b <genome_build> -db <name_of_annovar_db>
input.txt
file should be in the following format.
#sample_name | bam | vcf | depth | sex |
---|---|---|---|---|
sample_1 | sample_1.bam | sample_1.vcf | 200 | M |
sample_2 | sample_2.bam | sample_2.vcf | 200 | F |
Each line of input.txt
is a sample with its aligned reads in the bam format (with index in the same directory), and its candidate variants in the vcf (or vcf.gz) format. User should also provide the sequencing depth and the sex (M/F) of the corresponding sample. Sample name (#sample_name column) should be a unique identifier for each sample; duplicated names are not allowed.
Note the sequencing depth is required for increasing specificity and if the user is not clear about the average depth, we recommend piloting a fast depth analysis with SAMtools mpileup for several hundreds of variants, or a complete depth of coverage analysis. The depth value should be integers.
-
DeepMosaic supports no-loss image representation for sequencing depth up to 500x. Reads with deeper sequencing depth will be randomly down-sampled to 500x during image representation.
-
sample.bam
is a bam file that is generated through alignment, sort, markduplicate, indel realign, and base quality score recalibration. You can follow the BSMN common pipeline for both GRCh37 and GRCh38, or this pipeline for GRCh37 alignment specifically. Note that this used to be the best pipeline for GATK3 and earlier version. GATK4 onwards, however, integrated indel realign into haplotypecaller and MuTect2. So if you want to use any external tools you have to prepare the bam with earlier GATK and the tutorials should be here. -
sample.vcf
is the vcf file of input variants you are interested in, or prior file generated by GATK haplotypecaller with polidy 50 as described in previosu pipelines, or MuTect2 single mode, each vcf should be provided for each input bam and the format should be in the following format, gziped vcf is also recognizable:
#CHROM | POS | ID | REF | ALT | ... |
---|---|---|---|---|---|
1 | 17697 | . | G | C | . |
1 | 19890 | . | T | C | . |
"#CHROM", "POS", "REF", "ALT" are essential columns that will be parsed and utilized by DeepMosaic.
While using MuTect2 we recommend "PASS" vcfs as input for DeepMosaic. Running MuTect2 single mode, generate the panel of normals and downstream filtering could either be found following the official GATK tutorials, or following this example snakemake pipeline.
-
The outputs files including the extracted features and encoded imaged will be output to
[output_dir]
. DeepMosaic will create a new directory if[output_dir]
hasn't been initialized by users. -
path_to_ANNOVAR
is the absolute path to the ANNOVAR program directory. -
genome_build
is the build version of the reference genome, currentlyhg19
andhg38
are supported. defaults tohg19
. -
name_of_annovar_db
is the name of the db you want to use from the annovar subdirectory[annovar/humandb]
. For example, if you want to useannovar/humandb/hg38_gnomad312_genome.txt
, you would use-db gnomad312_genome
. This option is fed directly into the annovar command as--dbtype
. defaults tognomad_genome
. -
To generate h5 files for other genome builds (not recommended) please follow this link, note that this package runs in Python 2.7.
After deepmosaic-draw is successfully executed, the following files/directories would be generated in the [output_dir]
features.txt
contains the extracted features and the absolute path to the encoded image (.npy) file for each variant in each row.features.txt
will serve as input file to the next step of mosaicism prediction.
#sample_name | sex | chrom | pos | ref | alt | variant | maf | lower_CI | upper_CI | variant_type | gene_id | gnomad | all_repeat | segdup | homopolymer | dinucleotide | depth_fraction | image_filepath | npy_filepath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sample_1 | M | 1 | 17697 | G | C | 1_17697_G_C | 0.18236472945891782 | 0.15095348571574527 | 0.21862912439071866 | ncRNA_exonic | WASH7P | 0.1231 | 1 | 1 | 0 | 0 | 3.09 | /.../images/sample_1-1_17697_G_C.jpg | /.../matrices/sample_1-1_17697_G_C.npy |
-
matrices
is a directory of the encoded image representations in the .npy format for all the candidate variants from all samples. Names of the file would be in the format of[sample_name]-[chrom]_[pos]_[ref]_[alt].npy
. -
images
is a directory of the encoded image representations in the .jpg format for all the candidate variants from all samples. Names of the file would be in the format of[sample_name]-[chrom]_[pos]_[ref]_[alt].jpg
. Image files in this directory could be directly open and inspected visually by users. -
repeats_annotation.bed
is the intermediate file annotating the repeat and segdup information of each variant. -
input.hg19_gnomad_genome_dropped
,input.hg19_gnomad_genome_filtered
,input.exonic_variant_function
,input.variant_function
are ANNOVAR outputs annotating the gnomad and variant function information.
> [DeepMosaic Path]/deepmosaic/deepmosaic-predict -i <output_dir/feature.txt> -o <output.txt> -m [prediction_model (default: efficientnet-b4_epoch_6.pt)] -b [batch_size (default: 10)] -gb <genome_build>
-
output_dir/feature.txt
is the output file from last step. -
output.txt
is the final prediction results. -
prediction_model
is the pretrained DeepMosaic model. The default one (best performing model efficientnet-b4_epoch_6.pt) is trained on our train set for 6 epoch from the efficientnet-b4 architecture. -
batch_size
is the number of images (variants) predicted by DeepMosaic model simultaneously. Larger batch size means more memory and faster prediction. User can adjust this value depending on his/her available computing power. Default batch size is 10. -
genome_build
is the build version of the reference genome, currently hg19 and hg38 are supported.
#sample_name | sex | chrom | pos | ref | alt | variant | maf | lower_CI | upper_CI | variant_type | gene_id | gnomad | all_repeat | segdup | homopolymer | dinucleotide | depth_fraction | score1 | score2 | score3 | prediction | image_filepath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sample_1 | M | 1 | 17697 | G | C | 1_17697_G_C | 0.18236472945891782 | 0.15095348571574527 | 0.21862912439071866 | ncRNA_exonic | WASH7P | 0.1231 | 1 | 1 | 0 | 0 | 3.09 | 0.9999058880667084 | 6.519687262508766e-10 | 9.411128132280348e-05 | artifact | /.../images/sample_1-1_17697_G_C.jpg |
-
The prediction result is in the column "prediction". The possible results are
mosaic
,heterozygous
,ref_homozygous
,alt_homozygous
orartifact
. Only variants marked bymosaic
are DeepMosaic predicted mosaic positive. The prediction decision is made by considering the mosaic score generated by DeepMosaic deeplearning model as well as the extracted, user-input, as well as annotated features such as maf, depth_fraction, repeat, segdup, etc. All genomic coordinates and annotations are based on GRCh37d5 reference genome. -
Image representations of the variants are stored in the files indicated by "image_filepath" column. User can directly open the .jpg files and visually inspect the piled reads for sanity check.
-
Raw extracted, user-input, as well as annotated features are listed in the output file, to allow users to implement further filters:
maf
,lower_CI
, and upper_CI
are calculated from the mutant allelic fractions and 95% exact binomial confidence intervals extracted from the bam file.
variant_type
and gene_id
are annotated by ANNOVAR.
gnomad
is annotated from the combined allele frequency in gnomAD (v2.1.1).
all_repeat
and segdup
are provided in the "resources" folder.
homopolymer
and dinucleotide
are calculated from the .h5 files in the "resources" folder.
We also provided a Snakemake wrapper for DeepMosaic users.
We have provided a simple example in the sub-directory of "demo". The directory includes the input files and the expected results from running DeepMosaic. User could refer to the example for the expected input format and output format.
sample_1.vcf
sample_2.vcf
sample_3.vcf
sample_4.vcf
sample_1.bam sample_1.bam.bai
sample_2.bam sample_2.bam.bai
sample_3.bam sample_3.bam.bai
sample_4.bam sample_4.bam.bai
features.txt (intermediate result of running deepmosaic-draw)
final_predictions.txt (final result of running deepmosaic-predict)
-----images (image encodings in .jpg formats)
-----matrices (image encodings in .npy format to be used in prediction directly)
repeat.annotation.bed (intermediate file for repeat annotation)
input.variant_function, input.exonic_variant_function, input.hg19_gnomad_genome_dropped, input.hg19_gnomad_genome_filtered, input.log (intermediate files after running annovar)
#sample_name | bam | vcf | depth | sex |
---|---|---|---|---|
sample_1 | bams/sample_1.bam | vcfs/sample_1.vcf | 200 | M |
sample_2 | bams/sample_2.bam | vcfs/sample_2.vcf | 200 | M |
sample_3 | bams/sample_3.bam | vcfs/sample_3.vcf | 200 | M |
sample_4 | bams/sample_4.bam | vcfs/sample_4.vcf | 200 | M |
#sample_name | sex | chrom | pos | ref | alt | variant | maf | lower_CI | upper_CI | variant_type | gene_id | gnomad | all_repeat | segdup | homopolymer | dinucluotide | depth_fraction | score1 | score2 | score3 | prediction | image_filepath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sample_1 | M | 10 | 25509499 | A | G | 10_25509499_A_G | 0.05737704918032788 | 0.03448247887605271 | 0.09399263167327017 | intronic | GPR158 | 0.0 | 0 | 0 | 1 | 0 | 1.22 | 0.00010761513038663674 | 3.852715883900453e-05 | 0.9998538577107744 | mosaic | results/images/sample_1-10_25509499_A_G.jpg |
sample_2 | M | 14 | 37531674 | A | T | 14_37531674_A_T | 0.9948186528497408 | 0.9712392635106 | 0.9990847787125622 | intronic | SLC25A21 | 0.2267 | 1 | 0 | 1 | 1 | 0.98 | 0.19976294102631714 | 4.0270887857736005e-06 | 0.800233031884897 | alternative_homozygous | results/images/sample_2-14_37531674_A_T.jpg |
sample_3 | M | 20 | 1805075 | G | T | 20_1805075_G_T | 0.018072289156626502 | 0.008308354195089811 | 0.03886110152464575 | intergenic | LOC100289473(dist=44683),SIRPA(dist=69738) | 0.0 | 0 | 0 | 0 | 0 | 1.66 | 0.003562673370702711 | 2.9057256040721804e-06 | 0.9964344209036933 | mosaic | results/images/sample_3-20_1805075_G_T.jpg |
sample_4 | M | 16 | 65589896 | G | C | 16_65589896_G_C | 0.5306122448979592 | 0.43252467204457545 | 0.6263904306010359 | ncRNA_intronic | LINC00922 | 0.3142 | 1 | 0 | 1 | 0 | 0.49 | 0.9998079754132149 | 5.6467567415316954e-08 | 0.00019196811921752858 | heterozygous | results/images/sample_4-16_65589896_G_C.jpg |
Due to package differences and internal machine differences, the demo result on your machine might be slightly different from the numbers shown here (<0.1% deviations), but the overall prediction should be the same.
If you have you own training set, you can train you own DeepMosaic model using trainModel.py.
-i: input file, tab delimiated |path_to_npy_file_generated_by_DeepMosaic_draw|label|
-e: training epoches
-o: output directory
--model_type: supported model types, see the model folder
--model_path: path to the base model (pt file)
example command:
python trainModel.py -i test_input_training_10.csv -e 2 --model_type efficientnet-b4 --model_path efficientnet-b4_epoch_6.pt -o ./test_trained_model
Singularity containers can be found on Sylabs.
- The singularity container currently only works with hg19/GRCh37 and hg38/GRCh38.
- You'll need your own copy of ANNOVAR outside the singularity (please specify the path of ANNOVAR in
<options>
).
Basic Usage
singularity exec DeepMosaic.sif deepmosaic-draw <options>
singularity exec DeepMosaic.sif deepmosaic-predict <options>
Training and using your own model
singularity exec DeepMosaic.sif python /DeepMosaic/deepmosaic/trainModel.py <options>
singularity exec DeepMosaic.sif deepmosaic-predict <options> --model-path <path_to_your_model>
See Usage and Model Training for more details.
- WGS
We estimated > 90% experimental validation rate for WGS data identified as "mosaic" by DeepMosaic (GRCh37).
- WES
We estimated ~40% experimental validation rate for WES data identified as "mosaic" by the current DeepMosaic WGS model (GRCh37).
Note that the performance of DeepMosaic on GRCh38 will be different, our preliminary estimation showed.
Starting from Jan 2023, new Q&A section will be added to the wiki page, please also visit the issues or closed issues sections to see whether other users already encountered the same questions.
-
Q: How do I run DeepMosaic for multiple samples most efficiently?
A: If you have a large number of variants in each file, to run DeepMosaic in parallel, submit each file in independent input files. If you have a relatively small number of variants from each file but multiple files (samples), integrate everything together into one input file. If you have a huge vcf, you can split it into smaller vcfs and run them parallelly (for both visualization and quantification). You only need to split the vcf, not the bam file.
-
Q: How do I balance/further filter the variants base on DeepMosaic output?
A: For WGS variants, the exclusion of annotated homopolymer and dinucleotide repeats will remove false positives and increase the validation rate, but decrease the sensitivity.
-
Q: What do Score 1, Score 2, and Score 3 mean in the output file?
A: The three scores are combined information from the complex features extracted by the neural network, from our experiences, Score 1 is more like a "het and homo probability", Scores 2&3, especially Score 3 is more like a "potential mosaic possibility". In other words, the higher Score 1 is, the more likely the candidate is a germline variant, whereas the higher Score 3 is, the more likely the candidate is a mosaic variant. But both categories contained a lot of potential artifacts, that's why for the final output we included a more complex classifier.
-
Q: How to deal with mitochondria and sex chromosomes?
A: First you should choose a reference genome that supports mitochondria as a separate chromosome. DeepMosaic is not specifically trained on mitochondria variants so we can't guarantee the result, thus we suggest removing the MT variants from DeepMosaic input. For sex chromosomes, DeepMosaic takes into consideration the biological gender of the input sample and also considered the pseudo autosomal regions separately.
-
Q: Can I use DeepMosaic for cancer somatic mutation detection without control?
A: The current models presented by DeepMosaic does not support cancer samples, according to benchmarks, the specificity is high (0.97) while the sensitivity is low. We are training new models that support single sample accurate detection of somatic mutations in cancer.
-
Q: What genome versions does DeepMosaic support?
A: DeepMosaic is benchmarked on GRCh37(hg19) we are working on some tests for GRCh38(hg39) and are providing some scripts and annotation resources here the model is still the same so the main differences lie in coordinate differences. We will make further updates when we finish new models trained on GRCh38 or CHM13. As most of our current benchmark experiments are carried out on GRCh37 we cannot guarantee the performance on GRCh38.
-
Q: Why I got errors about pickle_module.load(f, **pickle_load_args)?
A: Because you didn't fully download DeepMosaic, the entire model folder should be more than 200 MB. Please refer to the git-lfs section in the tutorial.
Yang X*,#, Xin X*, et al. Gleeson JG#. Control-independent mosaic single nucleotide variant detection with DeepMosaic. (Nature Biotechnology)
The Manuscript is also available here.
Released under GNU-GPL 3.0 licence.
If you have any questions please post a thread at the issues section or contact us at:
📧 Xiaoxu Yang: [email protected], [email protected]
📧 Virginia (Xin) Xu: [email protected]
📧 Joseph Gleeson: [email protected]