Version v0.9.7 beta
This release contains several major enhancements particularly relevant to germline analysis. If used in production pipelines, further evaluation and benchmarking would be wise. Highlights:
Control sample clustering: To make better use of larger reference sample pools, reference --cluster
will correlate the given normal samples' bin-wise coverage depths to extract clusters to be used as reference profiles. The reference .cnn file produced this way will then contain the log2
and spread
summary statistics for each cluster, in addition to the global summary stats. Given this "clustered reference" profile, fix --cluster
will then correlate each test sample to each clustered log2
profile in the reference to choose the most relevant control pool for normalization. The batch
option --cluster
will perform both these steps. Nod to Gambin lab and the authors of ExomeDepth, CoNVaDING, CLAMMS, and others for inspiration. (#308)
Calculation of bin weights has changed. This will change your segmentation results, hopefully for the better. Details below. (#429)
The batch
pipeline now performs some segmentation post-processing automatically: calculating and filtering segmentation calls by 50% confidence intervals of the segment mean log2 ratios, in order to reduce false positives, followed by separate bin-level testing to detect small (e.g. exon-size) CNVs that were not caught by segmentation. The bin- and segment-level results are returned as separate .cns files; deciding whether and how to combine or use these results together is left as an exercise for the user.
We've dropped Python 2.7 support. Python version 3.5 or later is now required.
This is a beta release. Please let me know how it works for you via the Issues page. If this release contains any issues that are blocking your work, try installing one of the previous stable versions 0.9.6 or 0.9.5::
conda install cnvkit=0.9.6
Dependencies
- Remove all Python 2.7 compatibility shims.
- Raise minimum pandas version from 0.20.1 to 0.23.3.
- Add scikit-learn (dependency of pomegranate, for HMM segmentation). Remove the older hmmlearn implementation.
Commands
batch
:
- Post-process segments with
segmetrics
(50% CI),call
(filter by CI, but don't call integer copy number), andbintest
. - Return
bintest
result as a separate, independent .cns output. - Add option '--segment-method', equivalent to
segment -m
. - Rename option '--method' to '--seq-method' (but '--method' still accepted for now).
- Add option
--cluster
, passed toreference
andfix
if given. (#308)
bintest
:
- New command superseding
cnv_ztest.py
script. - Report p-value as a column
p_bintest
(previouslyztest
) in the .cns output. - Fix probabilities for positive log2 values, i.e. gains, which previously always had p-value = 1.0. (#429)
fix
:
- Change calculation of bin weights to be more consistent with
1-var
meaning, with more emphasis on reference spread. It is now simpler, more consistent withimport-rna
, and particularly improves the accuracy ofbintest
. (#429) - Squeeze the range of reference-free weights
- Drop bins with gc outside [.3, .7]. CLAMMS paper shows these bins carry no useful signal.
- With
--cluster
and a clustered reference input, calculate the test sample's Pearson correlation versus each cluster's log2, and take the best one for normalization.
reference
:
- With
--cluster
, do k-means clustering of the sample bin-level read depth correlation matrix, per Kusmirek et al. 2018. Parameter k defaults to the cube root of number of samples. Only clusters of at least 4 samples are kept for emitting summary statistics in the reference profile.
segment
:
- hmm: Fix pomegranate-based implementation. Use iterative Savitzky-Golay smoothing with a narrow bandwidth.
- Use HMM for post-TCN segmentation on VCF allele freqs
- Add parameter for smoothing before CBS (thanks @EwaMarek)
segmetrics
:
- Add 'ttest' option for 1-sample t-test p-value.
- Implement & expose --smooth-bootstrap option. For smoothing, KDE bandwidth is based on each bin's weight as a proxy for the SD of its log2 ratio values. To reduce the risk of over-smoothing on larger sample sizes, we use a loose interpretation of Silverman's Rule to reduce the bandwidth as the number of bins in a segment increases (k^-1/4).
API
do_heatmap
: Add 'ax' parameter (thanks @fbrundu)CNA.residuals()
: speed; keep index intact in returned pd.Series- smoothing: Linearly roll-off weights in mirrored wings. Affects CNA.smoothed() / savgol, but not rolling median bias correction.
- Rename
CNA.smoothed()
toCNA.smooth_log2()
, since it returns the smoothed log2 values, not a new/altered CNA.
Bug fixes
batch
: Fix argparse formatting issue (#466)import-rna
: Fix a regression in reading 2-column per-gene counts (-f counts
).reference
: Fix sex inference/usage when creating haploid-x reference (#459; thanks @duartemolha)scatter
: Use a safe matplotlib backend on OS X to avoid crash- VariantArray: Fix/streamline indexing of variants by bin/segment