Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for overlapping reads in Theoretical Sensitivity model #1802

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
9 changes: 7 additions & 2 deletions src/main/java/picard/analysis/CollectWgsMetrics.java
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,9 @@ public class CollectWgsMetrics extends CommandLineProgram {
@Argument(doc="Allele fraction for which to calculate theoretical sensitivity.", optional = true)
public List<Double> ALLELE_FRACTION = new ArrayList<>(Arrays.asList(0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.5));

@Argument(doc="Phred scaled PCR Error Rate for Theoretical Sensitivity model.", optional = true)
public int PCR_ERROR_RATE = 45;
fleharty marked this conversation as resolved.
Show resolved Hide resolved

@Argument(doc = "If true, fast algorithm is used.")
public boolean USE_FAST_ALGORITHM = false;

Expand Down Expand Up @@ -242,8 +245,10 @@ protected int doWork() {
if (THEORETICAL_SENSITIVITY_OUTPUT != null) {
// Write out theoretical sensitivity results.
final MetricsFile<TheoreticalSensitivityMetrics, ?> theoreticalSensitivityMetrics = getMetricsFile();
log.info("Calculating theoretical sentitivity at " + ALLELE_FRACTION.size() + " allele fractions.");
List<TheoreticalSensitivityMetrics> tsm = TheoreticalSensitivity.calculateSensitivities(SAMPLE_SIZE, collector.getUnfilteredDepthHistogram(), collector.getUnfilteredBaseQHistogram(), ALLELE_FRACTION);
log.info("Calculating theoretical sensitivity at " + ALLELE_FRACTION.size() + " allele fractions.");

List<TheoreticalSensitivityMetrics> tsm = TheoreticalSensitivity.calculateSensitivities(SAMPLE_SIZE,
collector.getUnfilteredDepthHistogram(), collector.getUnfilteredBaseQHistogram(), ALLELE_FRACTION, collector.basesExcludedByOverlap, PCR_ERROR_RATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collector.basesExcludedByOverlap needs to be divided by total bases to get fraction overlapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I think this is fixed by using collector.getMetrics(dupeFilter, adapterFilter, mapqFilter, pairFilter).PCT_EXC_OVERLAP instead.

theoreticalSensitivityMetrics.addAllMetrics(tsm);
theoreticalSensitivityMetrics.write(THEORETICAL_SENSITIVITY_OUTPUT);
}
Expand Down
53 changes: 43 additions & 10 deletions src/main/java/picard/analysis/TheoreticalSensitivity.java
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,10 @@ public TheoreticalSensitivity() {
* @param randomSeed random number seed to use for random number generator
* @return Theoretical sensitivity for the given arguments at a constant depth.
*/
public static double sensitivityAtConstantDepth(final int depth, final Histogram<Integer> qualityHistogram, final double logOddsThreshold, final int sampleSize, final double alleleFraction, final long randomSeed) {
public static double sensitivityAtConstantDepth(final int depth, final Histogram<Integer> qualityHistogram,
final double logOddsThreshold, final int sampleSize,
final double alleleFraction, final long randomSeed,
final double overlapProbability, final int pcrErrorRate) {
fleharty marked this conversation as resolved.
Show resolved Hide resolved
// If the depth is 0 at a particular locus, the sensitivity is trivially 0.0.
if (depth == 0) {
return 0.0;
Expand All @@ -264,10 +267,21 @@ public static double sensitivityAtConstantDepth(final int depth, final Histogram
for (int sample = 0; sample < sampleSize; sample++) {
final int altDepth = bd.sample();

final int sumOfQualities;
int sumOfQualities;
if (altDepth < LARGE_NUMBER_OF_DRAWS) {
// If the number of alt reads is "small" we draw from the actual base quality distribution.
sumOfQualities = IntStream.range(0, altDepth).map(n -> qualityRW.draw()).sum();
//sumOfQualities = IntStream.range(0, altDepth).map(n -> qualityRW.draw()).sum();
sumOfQualities = 0;
for(int i = 0;i < altDepth;i++) {
fleharty marked this conversation as resolved.
Show resolved Hide resolved
if(rg.nextDouble() > overlapProbability) {
sumOfQualities += Math.min(qualityRW.draw(), pcrErrorRate);
} else {
// Simulate overlapping reads by sampling from the quality distribution twice.
// Since PCR errors affect overlapping reads, the total contribution to the sum
// of qualities cannot exceed the PCR error rate.
sumOfQualities += Math.min(qualityRW.draw() + qualityRW.draw(), pcrErrorRate);
}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to adjust the large number approx code as well? Particularly since you are introducing a PCR base quality cap.

Copy link
Contributor Author

@fleharty fleharty Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't like the large number approx code. It adds complexity to the code, and really doesn't speed it up enough to justify it.

So I'm removing the approximation code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you profiled this change, in particular when we care about extremely deep data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are a bit long now, I'm trying to resolve the issue.

// If the number of alt reads is "large" we draw from a Gaussian approximation of the base
// quality distribution to speed up the code.
Expand Down Expand Up @@ -305,8 +319,12 @@ static int drawSumOfQScores(final int altDepth, final double averageQuality, fin
* @param alleleFraction the allele fraction to evaluate sensitivity at
* @return Theoretical sensitivity for the given arguments at a constant depth.
*/
private static double sensitivityAtConstantDepth(final int depth, final Histogram<Integer> qualityHistogram, final double logOddsThreshold, final int sampleSize, final double alleleFraction) {
return sensitivityAtConstantDepth(depth, qualityHistogram, logOddsThreshold, sampleSize, alleleFraction, RANDOM_SEED);
private static double sensitivityAtConstantDepth(final int depth, final Histogram<Integer> qualityHistogram,
final double logOddsThreshold, final int sampleSize,
final double alleleFraction, final double overlapProbability,
final int pcrErrorRate) {
return sensitivityAtConstantDepth(depth, qualityHistogram, logOddsThreshold, sampleSize, alleleFraction,
RANDOM_SEED, overlapProbability, pcrErrorRate);
}

/**
Expand All @@ -319,8 +337,17 @@ private static double sensitivityAtConstantDepth(final int depth, final Histogra
* @param alleleFraction the allele fraction to evaluate sensitivity at
* @return Theoretical sensitivity for the given arguments over a particular depth distribution.
*/
public static double theoreticalSensitivity(final Histogram<Integer> depthHistogram, final Histogram<Integer> qualityHistogram,
final int sampleSize, final double logOddsThreshold, final double alleleFraction) {
public static double theoreticalSensitivity(final Histogram<Integer> depthHistogram,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to keep this method overload, given that it is only used in tests? I get that it'd be an API change, but not sure how we feel about that. You're changing the API anyway with respect to sensitivityAtConstantDepth, so probably best to be consistent.

final Histogram<Integer> qualityHistogram, final int sampleSize,
final double logOddsThreshold, final double alleleFraction) {
return theoreticalSensitivity(depthHistogram, qualityHistogram, sampleSize, logOddsThreshold, alleleFraction,
0.0, 45);
}

public static double theoreticalSensitivity(final Histogram<Integer> depthHistogram,
final Histogram<Integer> qualityHistogram, final int sampleSize,
final double logOddsThreshold, final double alleleFraction,
final double overlapProbability, final int pcrErrorRate) {
if (alleleFraction > 1.0 || alleleFraction < 0.0) {
throw new IllegalArgumentException("Allele fractions must be between 0 and 1.");
}
Expand All @@ -346,7 +373,8 @@ public static double theoreticalSensitivity(final Histogram<Integer> depthHistog
}
// Calculate sensitivity for a particular depth, and use trapezoid rule to integrate sensitivity.
final double left = right;
right = sensitivityAtConstantDepth(currentDepth, qualityHistogram, logOddsThreshold, sampleSize, alleleFraction);
right = sensitivityAtConstantDepth(currentDepth, qualityHistogram, logOddsThreshold, sampleSize,
alleleFraction, overlapProbability, pcrErrorRate);
sensitivity += deltaDepthProbability * (left + right) / 2.0;
}
return sensitivity;
Expand Down Expand Up @@ -386,7 +414,11 @@ static double[] trimDistribution(final double[] distribution) {
* @param alleleFractions List of allele fractions to measure theoretical sensitivity over.
*/
public static List<TheoreticalSensitivityMetrics> calculateSensitivities(final int simulationSize,
final Histogram<Integer> depthHistogram, final Histogram<Integer> baseQHistogram, final List<Double> alleleFractions) {
final Histogram<Integer> depthHistogram,
final Histogram<Integer> baseQHistogram,
final List<Double> alleleFractions,
final double overlapProbability,
final int pcrErrorRate) {

final List<TheoreticalSensitivityMetrics> metricsOverVariousAlleleFractions = new ArrayList<>();
final double logOddsThreshold = 6.2; // This threshold is used because it is the value used for MuTect2.
Expand All @@ -396,7 +428,8 @@ public static List<TheoreticalSensitivityMetrics> calculateSensitivities(final i
for (final double alleleFraction : alleleFractions) {
final TheoreticalSensitivityMetrics theoreticalSensitivityMetrics = new TheoreticalSensitivityMetrics();
theoreticalSensitivityMetrics.ALLELE_FRACTION = alleleFraction;
theoreticalSensitivityMetrics.THEORETICAL_SENSITIVITY = TheoreticalSensitivity.theoreticalSensitivity(depthHistogram, baseQHistogram, simulationSize, logOddsThreshold, alleleFraction);
theoreticalSensitivityMetrics.THEORETICAL_SENSITIVITY = TheoreticalSensitivity.theoreticalSensitivity(
depthHistogram, baseQHistogram, simulationSize, logOddsThreshold, alleleFraction, overlapProbability, pcrErrorRate);
theoreticalSensitivityMetrics.THEORETICAL_SENSITIVITY_Q = QualityUtil.getPhredScoreFromErrorProbability((1 - theoreticalSensitivityMetrics.THEORETICAL_SENSITIVITY));
theoreticalSensitivityMetrics.SAMPLE_SIZE = simulationSize;
theoreticalSensitivityMetrics.LOG_ODDS_THRESHOLD = logOddsThreshold;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import picard.analysis.MetricAccumulationLevel;
import picard.analysis.TheoreticalSensitivity;
import picard.analysis.TheoreticalSensitivityMetrics;
import picard.analysis.WgsMetrics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

import picard.cmdline.CommandLineProgram;
import picard.cmdline.StandardOptionDefinitions;
import picard.metrics.MultilevelMetrics;
Expand Down Expand Up @@ -105,6 +106,9 @@ protected abstract COLLECTOR makeCollector(final Set<MetricAccumulationLevel> ac
@Argument(doc="Output for Theoretical Sensitivity metrics where the allele fractions are provided by the ALLELE_FRACTION argument.", optional = true)
public File THEORETICAL_SENSITIVITY_OUTPUT;

@Argument(doc="Phred scaled PCR Error Rate for Theoretical Sensitivity model.", optional = true)
public int PCR_ERROR_RATE = 45;

@Argument(doc="Allele fraction for which to calculate theoretical sensitivity.", optional = true)
public List<Double> ALLELE_FRACTION = new ArrayList<>(Arrays.asList(0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.5));
/**
Expand Down Expand Up @@ -169,7 +173,11 @@ protected int doWork() {
// Write out theoretical sensitivity results.
final MetricsFile<TheoreticalSensitivityMetrics, ?> theoreticalSensitivityMetrics = getMetricsFile();
log.info("Calculating theoretical sentitivity at " + ALLELE_FRACTION.size() + " allele fractions.");
List<TheoreticalSensitivityMetrics> tsm = TheoreticalSensitivity.calculateSensitivities(SAMPLE_SIZE, collector.getDepthHistogram(), collector.getBaseQualityHistogram(), ALLELE_FRACTION);

final double overlapProbability = ((PanelMetricsBase) metrics.getMetrics().toArray()[0]).PCT_EXC_OVERLAP;

List<TheoreticalSensitivityMetrics> tsm = TheoreticalSensitivity.calculateSensitivities(SAMPLE_SIZE,
collector.getDepthHistogram(), collector.getBaseQualityHistogram(), ALLELE_FRACTION, overlapProbability, PCR_ERROR_RATE);
theoreticalSensitivityMetrics.addAllMetrics(tsm);
theoreticalSensitivityMetrics.write(THEORETICAL_SENSITIVITY_OUTPUT);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,6 @@ public abstract class TargetMetricsCollector<METRIC_TYPE extends MultilevelMetri

private static final double LOG_ODDS_THRESHOLD = 3.0;

private final File theoreticalSensitivityOutput = null;

private final int minimumMappingQuality;
private final int minimumBaseQuality;
private final boolean clipOverlappingReads;
Expand Down
3 changes: 2 additions & 1 deletion src/test/java/picard/IntelInflaterDeflaterLoadTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ public void testIntelDeflaterIsAvailable() {
}

private void checkIntelSupported(final String componentName) {
if (!SystemUtils.IS_OS_LINUX && !SystemUtils.IS_OS_MAC) {
// Check if on Linux Mac or Apple Silicon (e.g. Apple M1)
if ((!SystemUtils.IS_OS_LINUX && !SystemUtils.IS_OS_MAC) || SystemUtils.OS_ARCH.equals("aarch64")) {
fleharty marked this conversation as resolved.
Show resolved Hide resolved
throw new SkipException(componentName + " is not available on this platform");
}

Expand Down
38 changes: 29 additions & 9 deletions src/test/java/picard/analysis/TheoreticalSensitivityTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -249,12 +249,16 @@ public void testSensitivityAtConstantDepth(final double expected, final File met
metrics.read(metricsFileReader);
}

Object hi[] = metrics.getMetrics().toArray();
fleharty marked this conversation as resolved.
Show resolved Hide resolved

final List<Histogram<Integer>> histograms = metrics.getAllHistograms();
final Histogram<Integer> qualityHistogram = histograms.get(1);

List<?> l = metrics.getMetrics();
fleharty marked this conversation as resolved.
Show resolved Hide resolved
// We ensure that even using different random seeds we converge to roughly the same value.
for (int i = 0; i < 3; i++) {
double result = TheoreticalSensitivity.sensitivityAtConstantDepth(depth, qualityHistogram, 3, sampleSize, alleleFraction, i);
double result = TheoreticalSensitivity.sensitivityAtConstantDepth(depth, qualityHistogram, 3,
sampleSize, alleleFraction, i, 0.0, 45);
Assert.assertEquals(result, expected, tolerance);
}
}
Expand All @@ -267,15 +271,20 @@ public Object[][] arbFracSensDataProvider() {
// are not quite large enough to converge properly, but is used for the purpose of
// keeping the compute time of the tests short.
return new Object[][]{
{0.90, wgsMetricsFile, 0.5, 400},
{0.78, wgsMetricsFile, 0.3, 400},
{0.29, wgsMetricsFile, 0.1, 500},
{0.08, wgsMetricsFile, 0.05, 500},
{0.90, wgsMetricsFile, 0.5, 400, false},
{0.78, wgsMetricsFile, 0.3, 400, false},
{0.29, wgsMetricsFile, 0.1, 500, false},
{0.08, wgsMetricsFile, 0.05, 500, false},

{0.90, wgsMetricsFile, 0.5, 400, true},
{0.80, wgsMetricsFile, 0.3, 400, true},
{0.35, wgsMetricsFile, 0.1, 500, true},
{0.12, wgsMetricsFile, 0.05, 500, true}
};
}

@Test(dataProvider = "TheoreticalSensitivityDataProvider")
public void testSensitivity(final double expected, final File metricsFile, final double alleleFraction, final int sampleSize) throws Exception {
public void testSensitivity(final double expected, final File metricsFile, final double alleleFraction, final int sampleSize, final boolean useOverlapProbability) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some tests that explore a range of different overlap probabilities and pcr error rates

// This tests Theoretical Sensitivity using distributions on both base quality scores
// and the depth histogram.

Expand All @@ -292,7 +301,15 @@ public void testSensitivity(final double expected, final File metricsFile, final
final Histogram<Integer> depthHistogram = histograms.get(0);
final Histogram<Integer> qualityHistogram = histograms.get(1);

final double result = TheoreticalSensitivity.theoreticalSensitivity(depthHistogram, qualityHistogram, sampleSize, 3, alleleFraction);
// Get overlap probability from PCT_EXC_OVERLAP
final double overlapProbability = ((WgsMetrics) metrics.getMetrics().toArray()[0]).PCT_EXC_OVERLAP;
final double result;
if (useOverlapProbability) {
result = TheoreticalSensitivity.theoreticalSensitivity(depthHistogram, qualityHistogram, sampleSize, 3, alleleFraction, overlapProbability, 45);
}
else {
result = TheoreticalSensitivity.theoreticalSensitivity(depthHistogram, qualityHistogram, sampleSize, 3, alleleFraction);
}

Assert.assertEquals(result, expected, tolerance);
}
Expand Down Expand Up @@ -327,6 +344,7 @@ public void testHetVsArbitrary(final File metricsFile, final double tolerance, f
final double[] qualityDistribution = TheoreticalSensitivity.normalizeHistogram(qualityHistogram);
final double[] depthDistribution = TheoreticalSensitivity.normalizeHistogram(depthHistogram);

// TS is TheoreticalSensitivity, THS is TheoreticalHetSensitivity
final double resultFromTS = TheoreticalSensitivity.theoreticalSensitivity(depthHistogram, qualityHistogram, sampleSize, 3, 0.5);
final double resultFromTHS = TheoreticalSensitivity.hetSNPSensitivity(depthDistribution, qualityDistribution, sampleSize, 3);

Expand Down Expand Up @@ -382,7 +400,8 @@ public void testDrawSumOfQScores(final File metricsFile, final int altDepth, fin
final List<Histogram<Integer>> histograms = metrics.getAllHistograms();

final Histogram<Integer> qualityHistogram = histograms.get(1);
final TheoreticalSensitivity.RouletteWheel qualityRW = new TheoreticalSensitivity.RouletteWheel(TheoreticalSensitivity.trimDistribution(TheoreticalSensitivity.normalizeHistogram(qualityHistogram)));
final TheoreticalSensitivity.RouletteWheel qualityRW = new TheoreticalSensitivity.RouletteWheel(
TheoreticalSensitivity.trimDistribution(TheoreticalSensitivity.normalizeHistogram(qualityHistogram)));

final Random randomNumberGenerator = new Random(51);

Expand All @@ -392,7 +411,8 @@ public void testDrawSumOfQScores(final File metricsFile, final int altDepth, fin

for (int k = 0; k < 1; k++) {
int sumOfQualitiesFull = IntStream.range(0, altDepth).map(n -> qualityRW.draw()).sum();
int sumOfQualities = TheoreticalSensitivity.drawSumOfQScores(altDepth, averageQuality, standardDeviationQuality, randomNumberGenerator.nextGaussian());
int sumOfQualities = TheoreticalSensitivity.drawSumOfQScores(altDepth, averageQuality,
standardDeviationQuality, randomNumberGenerator.nextGaussian());

Assert.assertEquals(sumOfQualitiesFull, sumOfQualities, sumOfQualitiesFull * tolerance);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,8 @@ public void testCoverageHistogram() throws IOException {
"MINIMUM_MAPPING_QUALITY=" + minimumMappingQuality,
"MINIMUM_BASE_QUALITY=" + minimumBaseQuality,
"CLIP_OVERLAPPING_READS=" + clipOverlappingReads,
"SAMPLE_SIZE=" + sampleSize
"SAMPLE_SIZE=" + sampleSize,
"THEORETICAL_SENSITIVITY_OUTPUT=" + "ts.out"
};

Assert.assertEquals(runPicardCommandLine(args), 0);
Expand Down
Loading