Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensitivity issues #50

Open
valeandri opened this issue Nov 13, 2024 · 9 comments
Open

Sensitivity issues #50

valeandri opened this issue Nov 13, 2024 · 9 comments

Comments

@valeandri
Copy link

Hi @GermanDemidov,

Thanks for the precious tool and the great documentation on it.

I am considering to use the tool for germline (and eventually mosaic) CNV calling in panels and WES scenarios.
In order to have a fair comparison with the tool previously used, I am testing the algorithm using the ICR639 dataset. It has 74 CNVs on 72 samples and I am using the 72 samples as "reference". Using this approach, I obtain a sensitivity of 81% compared to the previous one which was around 98%.

Here are the steps and commands used for the benchmark:

  1. bed file annotation and preparation
BedAnnotateGC -in $inBed -ref $fasta > annotated.bed
  1. bam coverage extraction
BedCoverage -bam $inBam -in annotated.bed -min_mapq 3 -out $outCov

Merged coverage than has the following structure:

#chr	start	end	10990	11030	11035	11116	11199	11256	11294	11301	11321	11488	11504	12574	12683	16693	17296	17297	17302	17303	17304	17305	17307	17316	17317	17318	17319	17320	17322	17324	17326	17331	17333	17336	17337	17338	17340	17342	17343	17357	17358	17359	17364	17365	17366	17369	17371	17375	17377	17378	17379	17380	17384	17385	17386	17388	17390	17392	17393	17395	17396	17398	17402	17403	17568	17588	18891	19292	21506	21541	21902	21964	22332	9954
chr1	10325412	10325414	126.00	126.50	136.00	195.50	122.50	144.50	154.50	211.00	215.00	123.50	119.00	86.50	103.50	232.00	161.50	186.50	185.50	191.00	200.00	201.00	217.00	185.00	215.50	200.50	201.50	232.00	274.50	271.50	220.00	183.00	318.00	197.50	165.50	257.50	194.50	121.00	241.50	209.50	178.50	168.50	212.50	209.00	150.00	115.50	199.50	180.00	190.00	170.50	180.00	181.00	127.50	173.50	191.50	143.50	186.00	170.50	165.50	196.50	186.50	154.00	163.00	175.50	159.00	162.00	176.00	147.50	352.00	348.50	179.50	176.00	214.00	159.50

  1. cnv calling
Rscript /ClinCNV/clinCNV.R \
  --normal $outCovMerged  \
  --out $outputFolder   \
  --bed annotated.bed        \
  --folderWithScript /ClinCNV/

Am I missing some crucial points?

Tools versions:

ngs-bits 2019_09 (installed using bioconda)
clinCNV 1.18.3 (latest)

@GermanDemidov
Copy link
Collaborator

Hi Valentina,

in general everything is good! For human genome hg38 I think you also need to turn --hg38 flag, but it often gives an error when the wrong reference genome was used. What are the varaints which were not called, are they very short? Are they mosaic? Was this dataset described somewhere so I could take a look at the panel design?

@valeandri
Copy link
Author

Thanks for the quick answer!

Here is the link to the dataset publication, where you can find more about it : https://pubmed.ncbi.nlm.nih.gov/30175241/. To give you a quick overview, the samples has been analyzed with the Illumina TruSight Cancer Panel, comprising ~100 cancer related genes.

I did not use --hg38 (even if the tool did not throw any error).

The variants missed are germline and most of them is small, comprising 1 or 2 exons.

If you need any other information, just ask ;)

Valentina

@GermanDemidov
Copy link
Collaborator

I see =) so the varaints could be discarded if they were in "centromere" of hg38 (it could be normal part in hg19) and also set --lengthG 0 (which means even the smallest variants are called). Run this and let me know the sensitivity - it is not the maximum what can be done =) (I hope you have this test run and sensitivity check as a script so you can run it several times without involving a lot of human efforts)

100 genes, in particular, is not a lot of genetic material and some parts could be filtered out as e.g. GC extreme - but this additional trick will improve the situation https://github.com/GermanDemidov/segmentation_before_CNV_calling - but first try to run without it, just with--hg38 --lengthG 0 --scoreG 10 (the last command reduces the quality threshold of the called varaints, lengthG as I said is the minimum length in units, which are exons I guess)

@valeandri
Copy link
Author

Thank you!!!

By running the script with your suggestions, I got a sensitivity of 94.4%! A great improvement ;)

I also checked the FP rate and looks fine, with an average of 2,1 calls per sample.

Do you have any other suggestions, beside trying out the segmntation as well? (https://github.com/GermanDemidov/segmentation_before_CNV_calling )

Again thanks,
valentina

@GermanDemidov
Copy link
Collaborator

Hi Valentina,

great news! You can probably also check if the true positive calls have higher log-likelihood score than the false positives and tune your threshold accordingly.

What are the missing CNVs? Are they in PMS2 gene? Can you open .seg files from the false negative samples in the IGV and check the regions of missing CNVs, what do you see?

@valeandri
Copy link
Author

Hi German,

I have the results of the FN analysis, and it turns out that PMS2 is not the issue. The benchmark has been performed using the exon as the fundamental unit, so that the total number TP+FN is the total number of events tested.

As you can see from the attached file, most of the exons not called are ate the end or at the beginning of a bigger variant and are annotated as "TooShortOrNA" or "GCnormFailed" in the _cov.seg file.
I understood that the exons <50 bp are excluded, so I will fix this in the bed design. How can I improve the other ones? Is there a way to do avoid the exclusion of such exons?

The attached file has the following columns:

  • CHR
  • BENCH EXON START
  • BENCH EXON END
  • GENE
  • BENCH VARIANT (can include multiple exons)
  • SAMPLE
  • VARIANCE (from _cov.seg)
  • VALUE (from _cov.seg)
  • CN (from cnvs.tsv, "." if NOT CALLED)

Thank you,
Valentina

@GermanDemidov
Copy link
Collaborator

Hi Valentina,

Yes, I see. I think you can expand small exons to be like 100base pairs. I doubt the actual target for enrichment was 50bp, I would use the original enrichment BED file, not the list of exons.

You can change the 50bp threshold here: https://github.com/imgag/ClinCNV/blob/master/clinCNV.R, line 293.

For GC normalization, you may relax the criteria here https://github.com/imgag/ClinCNV/blob/master/generalHelpers.R#L132 or line 134, put 10 instead of 25 and 50. It means "if you have less than 10 exons with GC of this value, remove them, otherwise, perform GC normalization".

Try to run this modified code and let me know if it is better.

@valeandri
Copy link
Author

Hi German,

Thank you as always for the helpful suggestions!

I applied those changes and I get a sensitivity of 96.40%, missing 13 exons out of a total of 357. Most of these exons are at the border of a larger variant, only 3 variants are completely missed (all small variants). Here I print the variants together with the _cov.seg VALUE of the relative exons:

SDHB Exon 1 deletion 1.56
PMS2 Exon 11 duplication 2.73
BRCA1 Exon 18-19 deletion 0.88-1.61

It is a really good result! Of course, I call double the number of the variants, but I can manage filters using the log-likelihood.

Do you think those variants could be recovered?

Thanks!
Valentina

@GermanDemidov
Copy link
Collaborator

Hi Valentina,

the answer actually depends. Can you check the plots produced, there should be a clustering picture - do you see any clustering of the samples?

1.56 is actually closer to diploid than to deletion (1 vs 2) - this one is impossible to save, unless you have clustering of the samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants