Release Version 0.8 · etal/cnvkit

This is a larger release and the first update since our publication.

CNVkit now runs under Python 3 as well as 2.7. (#3, #101; thanks @mpschr)

File format changes:

New "depth" column in .cnn, .cnr, .cns
In .cns, "weight" is the sum, not mean, of bin-level weights within the segment

New script cnn_updater.py can be used to add the "depth" column to existing .cnn, .cnr and .cns files. However, most CNVkit commands should still work with pre-v0.8 files without using this script first. For best results, rebuild the .cnr and .cns for an ongoing study using the existing targetcoverage, antitargetcoverage and reference .cnn files.

Algorithmic changes:

reference, gender, call, diagram, export: Gender, or chromosomal sex, is now inferred with a statistical test instead of a fixed threshold, significantly improving the inferences on noisy or aneuploid samples. (#116)
reference, fix, call: Center log2 values by median of chromosome medians, by default. (#114)
reference, metrics, segmetrics: Improve the calculation of biweight location and biweight midvariance (now in descriptives.py).

These deprecated components (since 0.7.x) have been removed:

Commands rescale and loh -- use call and scatter, respectively, instead
Some options in export bed and export theta -- use call first instead
Script genome2access.py -- use cnvkit.py access instead

Updated commands:

batch:

New option --method, with choices "hybrid" (default), "wgs", "amplicon", to simplify/streamline usage with whole-genome or amplicon sequencing protocols. See documentation for details; in short, "wgs" and "amplicon" do not use antitargets or the edge/density bias correction; "wgs" by default uses the sequencing-accessible genome as the targets, and uses a more stringent significance threshold for segmentation.
Hide/deprecate --split option; it's always on now. To ensure bin coordinates do not change between batch runs (they generally won't anyway), use the -r/--reference option instead of specifying -t and -a in batch.
Add --drop-low-coverage option, which is passed to segment internally.
The -p/--processes option is also passed to coverage and segment internally (see below).

antitarget:

Increase the default average bin size from 100kb to 200kb.

coverage:

Parallelize coverage calculation over BED rows. The number of threads can be specified with the -p option. (#121; thanks @brentp)

segment:

Parallelize CBS and Haar segmentation methods across chromosomes. (#123, #125; thanks @brentp)

call:

New --filter option, with choices 'cn', 'ampdel', 'ci', 'sem' implemented.
With VCF b-allele frequencies (-v, 'baf'), always calculate the allele-specific integer copy numbers 'cn1' and 'cn2' so that 'cn1' is the larger one. BAF mirror direction stays majority-rules. (#105; thanks @mpschr)
If b-allele frequencies are used and total copy number is zero, report allelic copy numbers as 0, not NaN.

scatter:

Add --title option.
Allow selecting & labeling gene(s) w/ only segments as input.

heatmap, scatter:

Allow saving plots in any image file format supported by matplotlib, not just The file format is determined by the output filename's extension, e.g. 'png' saves in PNG format -- making it easier to integrate CNVkit plots with HTML reports. (#120; thanks @chapmanb)

diagram:

Add -g/--gender option to specify sample's known gender.

gainloss:

Make output tables more consistent across options. Show individual gene names (rather than all genes grouped within a segment in 1 row); don't show rows with no gene name; report the segment probe count instead of number of probes within the gene; show any extra columns present in the input .cns file. (#107, #108; thanks @mpschr)

gender:

Show column headers and Y-chromosome log2 values in the output table.

segmetrics:

Add stats options for mean, median, mode
Add MSE, SEM stats as options

metrics, segmetrics:

Add --drop-low-coverage option (like in segment and gainloss)

Internals:

New sub-package tabio: a more robust I/O framwork unifying support for tabular formats, including CNVkit's .cnn/.cnr/.cns, BED, SEG, VCF, GATK/Picard interval list, and text coordinates (chr:start:end). Base class GenomicArray and its derived classes CopyNumArray and VariantArray do not implement their own I/O, but rather are instantiated via tabio. The "import-" commands use this as well.
Removed rary.RegionArray; all functionality is now in tabio and GenomicArray.
New module "descriptives.py" implements descriptive statistics on plain numpy arrays or pandas Series instances, independent of CNVkit.
Better testing on Travis, covering Python 2.7, 3.4 and 3.5, on both Linux and OS X (thanks @kyleabeauchamp, @rmcgibbo, and @mpharrigan; #110)

Bug fixes:

batch: Errors in parallel processes will immediately be raised as exceptions at the top level, rather than dying silently. Previously, no error would occur until a missing output file was needed later in the pipeline. (#55)
segment:
- Skip possible R warning text when parsing CBS output (#106) and run Rscript with the --vanilla option (#112; thanks @jsmedmar). Non-isolated R processes were prone to add various warning messages to the expected SEG output, which could crash the "segment" command for some users.
- Handle zero-weight bins better (#128; thanks @chapmanb).
scatter:
- Handle selected segments with an empty gene name (#104; thanks @mpschr).
- Don't crash on zero-length GenomicArray/CopyNumArray inputs.
VCF parsing (now within tabio) improved:
- More robust to missing genotype (GT) & depth (DP) fields (#102)
- Handle VCFs from MuTect2 (#122)
export theta: don't crash when SNP VCF is a single, unpaired sample, or if segmented input (.cns) is empty.
heatmap: Avoid a possible crash if a sample is missing a chromosome.

Packaging:

Universal wheels are enabled for installation with pip (setup.cfg).

New & updated dependencies:

futures
futurize
numpy raised to version 1.9
pandas raised to version 0.18.1
pysam version 0.9.1.1 is specifically excluded

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.8