Skip to content

Version v0.9.7 beta

Compare
Choose a tag to compare
@etal etal released this 30 Nov 23:04
· 252 commits to master since this release

This release contains several major enhancements particularly relevant to germline analysis. If used in production pipelines, further evaluation and benchmarking would be wise. Highlights:

Control sample clustering: To make better use of larger reference sample pools, reference --cluster will correlate the given normal samples' bin-wise coverage depths to extract clusters to be used as reference profiles. The reference .cnn file produced this way will then contain the log2 and spread summary statistics for each cluster, in addition to the global summary stats. Given this "clustered reference" profile, fix --cluster will then correlate each test sample to each clustered log2 profile in the reference to choose the most relevant control pool for normalization. The batch option --cluster will perform both these steps. Nod to Gambin lab and the authors of ExomeDepth, CoNVaDING, CLAMMS, and others for inspiration. (#308)

Calculation of bin weights has changed. This will change your segmentation results, hopefully for the better. Details below. (#429)

The batch pipeline now performs some segmentation post-processing automatically: calculating and filtering segmentation calls by 50% confidence intervals of the segment mean log2 ratios, in order to reduce false positives, followed by separate bin-level testing to detect small (e.g. exon-size) CNVs that were not caught by segmentation. The bin- and segment-level results are returned as separate .cns files; deciding whether and how to combine or use these results together is left as an exercise for the user.

We've dropped Python 2.7 support. Python version 3.5 or later is now required.

This is a beta release. Please let me know how it works for you via the Issues page. If this release contains any issues that are blocking your work, try installing one of the previous stable versions 0.9.6 or 0.9.5::

conda install cnvkit=0.9.6

Dependencies

  • Remove all Python 2.7 compatibility shims.
  • Raise minimum pandas version from 0.20.1 to 0.23.3.
  • Add scikit-learn (dependency of pomegranate, for HMM segmentation). Remove the older hmmlearn implementation.

Commands

batch:

  • Post-process segments with segmetrics (50% CI), call (filter by CI, but don't call integer copy number), and bintest.
  • Return bintest result as a separate, independent .cns output.
  • Add option '--segment-method', equivalent to segment -m.
  • Rename option '--method' to '--seq-method' (but '--method' still accepted for now).
  • Add option --cluster, passed to reference and fix if given. (#308)

bintest:

  • New command superseding cnv_ztest.py script.
  • Report p-value as a column p_bintest (previously ztest) in the .cns output.
  • Fix probabilities for positive log2 values, i.e. gains, which previously always had p-value = 1.0. (#429)

fix:

  • Change calculation of bin weights to be more consistent with 1-var meaning, with more emphasis on reference spread. It is now simpler, more consistent with import-rna, and particularly improves the accuracy of bintest. (#429)
  • Squeeze the range of reference-free weights
  • Drop bins with gc outside [.3, .7]. CLAMMS paper shows these bins carry no useful signal.
  • With --cluster and a clustered reference input, calculate the test sample's Pearson correlation versus each cluster's log2, and take the best one for normalization.

reference:

  • With --cluster, do k-means clustering of the sample bin-level read depth correlation matrix, per Kusmirek et al. 2018. Parameter k defaults to the cube root of number of samples. Only clusters of at least 4 samples are kept for emitting summary statistics in the reference profile.

segment:

  • hmm: Fix pomegranate-based implementation. Use iterative Savitzky-Golay smoothing with a narrow bandwidth.
  • Use HMM for post-TCN segmentation on VCF allele freqs
  • Add parameter for smoothing before CBS (thanks @EwaMarek)

segmetrics:

  • Add 'ttest' option for 1-sample t-test p-value.
  • Implement & expose --smooth-bootstrap option. For smoothing, KDE bandwidth is based on each bin's weight as a proxy for the SD of its log2 ratio values. To reduce the risk of over-smoothing on larger sample sizes, we use a loose interpretation of Silverman's Rule to reduce the bandwidth as the number of bins in a segment increases (k^-1/4).

API

  • do_heatmap: Add 'ax' parameter (thanks @fbrundu)
  • CNA.residuals(): speed; keep index intact in returned pd.Series
  • smoothing: Linearly roll-off weights in mirrored wings. Affects CNA.smoothed() / savgol, but not rolling median bias correction.
  • Rename CNA.smoothed() to CNA.smooth_log2(), since it returns the smoothed log2 values, not a new/altered CNA.

Bug fixes

  • batch: Fix argparse formatting issue (#466)
  • import-rna: Fix a regression in reading 2-column per-gene counts (-f counts).
  • reference: Fix sex inference/usage when creating haploid-x reference (#459; thanks @duartemolha)
  • scatter: Use a safe matplotlib backend on OS X to avoid crash
  • VariantArray: Fix/streamline indexing of variants by bin/segment