Version 0.8
This is a larger release and the first update since our publication.
CNVkit now runs under Python 3 as well as 2.7. (#3, #101; thanks @mpschr)
File format changes:
- New "depth" column in .cnn, .cnr, .cns
- In .cns, "weight" is the sum, not mean, of bin-level weights within the segment
New script cnn_updater.py
can be used to add the "depth" column to existing .cnn, .cnr and .cns files. However, most CNVkit commands should still work with pre-v0.8 files without using this script first. For best results, rebuild the .cnr and .cns for an ongoing study using the existing targetcoverage, antitargetcoverage and reference .cnn files.
Algorithmic changes:
reference
,gender
,call
,diagram
,export
: Gender, or chromosomal sex, is now inferred with a statistical test instead of a fixed threshold, significantly improving the inferences on noisy or aneuploid samples. (#116)reference
,fix
,call
: Center log2 values by median of chromosome medians, by default. (#114)reference
,metrics
,segmetrics
: Improve the calculation of biweight location and biweight midvariance (now in descriptives.py).
These deprecated components (since 0.7.x) have been removed:
- Commands
rescale
andloh
-- usecall
andscatter
, respectively, instead - Some options in
export bed
andexport theta
-- usecall
first instead - Script
genome2access.py
-- usecnvkit.py access
instead
Updated commands:
batch
:
- New option --method, with choices "hybrid" (default), "wgs", "amplicon", to simplify/streamline usage with whole-genome or amplicon sequencing protocols. See documentation for details; in short, "wgs" and "amplicon" do not use antitargets or the edge/density bias correction; "wgs" by default uses the sequencing-accessible genome as the targets, and uses a more stringent significance threshold for segmentation.
- Hide/deprecate --split option; it's always on now. To ensure bin coordinates do not change between
batch
runs (they generally won't anyway), use the -r/--reference option instead of specifying -t and -a inbatch
. - Add --drop-low-coverage option, which is passed to
segment
internally. - The -p/--processes option is also passed to
coverage
andsegment
internally (see below).
antitarget
:
- Increase the default average bin size from 100kb to 200kb.
coverage
:
- Parallelize coverage calculation over BED rows. The number of threads can be specified with the
-p
option. (#121; thanks @brentp)
segment
:
call
:
- New --filter option, with choices 'cn', 'ampdel', 'ci', 'sem' implemented.
- With VCF b-allele frequencies (
-v
, 'baf'), always calculate the allele-specific integer copy numbers 'cn1' and 'cn2' so that 'cn1' is the larger one. BAF mirror direction stays majority-rules. (#105; thanks @mpschr) - If b-allele frequencies are used and total copy number is zero, report allelic copy numbers as 0, not NaN.
scatter
:
- Add --title option.
- Allow selecting & labeling gene(s) w/ only segments as input.
heatmap
, scatter
:
- Allow saving plots in any image file format supported by matplotlib, not just The file format is determined by the output filename's extension, e.g. 'png' saves in PNG format -- making it easier to integrate CNVkit plots with HTML reports. (#120; thanks @chapmanb)
diagram
:
- Add -g/--gender option to specify sample's known gender.
gainloss
:
- Make output tables more consistent across options. Show individual gene names (rather than all genes grouped within a segment in 1 row); don't show rows with no gene name; report the segment probe count instead of number of probes within the gene; show any extra columns present in the input .cns file. (#107, #108; thanks @mpschr)
gender
:
- Show column headers and Y-chromosome log2 values in the output table.
segmetrics
:
- Add stats options for mean, median, mode
- Add MSE, SEM stats as options
metrics
, segmetrics
:
- Add --drop-low-coverage option (like in
segment
andgainloss
)
Internals:
- New sub-package tabio: a more robust I/O framwork unifying support for tabular formats, including CNVkit's .cnn/.cnr/.cns, BED, SEG, VCF, GATK/Picard interval list, and text coordinates (chr:start:end). Base class GenomicArray and its derived classes CopyNumArray and VariantArray do not implement their own I/O, but rather are instantiated via tabio. The "import-" commands use this as well.
- Removed rary.RegionArray; all functionality is now in tabio and GenomicArray.
- New module "descriptives.py" implements descriptive statistics on plain numpy arrays or pandas Series instances, independent of CNVkit.
- Better testing on Travis, covering Python 2.7, 3.4 and 3.5, on both Linux and OS X (thanks @kyleabeauchamp, @rmcgibbo, and @mpharrigan; #110)
Bug fixes:
batch
: Errors in parallel processes will immediately be raised as exceptions at the top level, rather than dying silently. Previously, no error would occur until a missing output file was needed later in the pipeline. (#55)segment
:- Skip possible R warning text when parsing CBS output (#106) and run Rscript with the --vanilla option (#112; thanks @jsmedmar). Non-isolated R processes were prone to add various warning messages to the expected SEG output, which could crash the "segment" command for some users.
- Handle zero-weight bins better (#128; thanks @chapmanb).
scatter
:- VCF parsing (now within tabio) improved:
export theta
: don't crash when SNP VCF is a single, unpaired sample, or if segmented input (.cns) is empty.heatmap
: Avoid a possible crash if a sample is missing a chromosome.
Packaging:
- Universal wheels are enabled for installation with pip (setup.cfg).
New & updated dependencies:
- futures
- futurize
- numpy raised to version 1.9
- pandas raised to version 0.18.1
- pysam version 0.9.1.1 is specifically excluded