Skip to content

Releases: etal/cnvkit

Version 0.7.10

06 Apr 21:08
Compare
Choose a tag to compare

Version 0.7.10

diagram:

  • Label genes even when given only segments (.cns). Plotting segments alone, without bin-level copy ratios (.cnr), can be convenient to produce an uncluttered PDF with a smaller file size while retaining most of the important CNV information. (#94)

scatter:

  • For calculating and plotting SNV b-allele frequencies, select the sample of interest from the given VCF based on the .cnr/.cns base filename, unless specified with --sample-id.

export nexus-ogt:

  • Use normal-sample BAFs if normal-sample .cnr given. Previously, it would load tumor BAFs (taking the first tumor sample from the PEDIGREE tag) even if the properly-named .cnr file was for the normal sample in the VCF.
  • Add --sample-id option to select VCF sample. Useful in case .cnr filename base doesn't match the sample IDs in the VCF header.
  • Add filtering options --min-weight, --min-variant-depth.
    • The --min-variant-depth option works the same as in scatter -v, filtering SNVs by coverage depth (INFO field DP, usually) for the b-allele frequency calculation.
    • The --min-weight option allows the user to discard low-weight bins since Nexus Copy Number doesn't use CNVKit's weights for its own segmentation and could be misled by the noisier log2 ratios in less-reliable bins. For choosing the cutoff value, 0.5 is suitable in our experience, but check the distribution of weights in your own data first.

export vcf:

  • Add custom VCF "FORMAT" fields: FOLD_CHANGE, FOLD_CHANGE_LOG2, PROBES. (#91; thanks @pcingola)

segment:

  • The "flasso" method now works again; it was broken for a few releases. (#88; thanks @pcingola)

Packaging & internal:

  • Add GRCh37 "access" BED file for users' convenience. The access command will also now raise an error if the chromosome names don't match between the "access" and "target" BED files.
  • Work with the latest version of pysam (0.9). (#86)
  • Silence some superfluous warnings from the latest version of pandas (0.18).
  • Documentation updates, including more details on the call command.

Version 0.7.9

14 Mar 18:15
Compare
Choose a tag to compare

Bug fixes, most importantly to work around an API change in pysam.

Installation:

  • Require pysam version earlier than 0.9 (#86)

fix, reference:

  • If the majority of target bins have no or very low coverage, warn the user
    about this, skip bias corrections, and mask out the low-coverage target bins
    during centering to ensure the output is still vaguely usable and sane.
    This issue could occur because the wrong target BED was used initially, or
    maybe hybridization failed in library prep.

reference:

  • Ensure the output table's columns are ordered correctly. In some cases it was
    possible for the output tables columns to be ordered differently, which still
    works in CNVkit, but is weird.

call, rescale, export:

  • Check specified gender more sensibly; on failure, default to female.
    Specifically, use case-insensitive string comparison to test whether the given
    argument means "male". Treating chrX as having neutral ploidy is probably a
    less surprising fallback, especially if the "-y" flag is forgotten elsewhere
    in the pipeline.

Version 0.7.8

04 Mar 00:42
Compare
Choose a tag to compare

New features in the call command make it more amenable to analyzing tumor heterogeneity, and also make the rescale command redundant. Documentation is updated with more methodological background info.

call:

  • Put absolute copy number in a new "cn" column. When rescaling log2 ratios for purity, do not round to integer absolute copy number values. (#83)
  • New -v/--vcf option: Calculate b-allele frequency (BAF) average for each segment and output as a new column "baf". Rescale BAFs if --purity is specified. Then, using BAF and total copy number (CN, the "cn" column), assign major and minor allele copy number to each segment and output as new columns "cn1" and "cn2". These values can indicate allelic imbalance, including loss of heterozygosity (LOH). (#84)
  • New --center option that works the same as in rescale.
  • New method -m none to perform any specified transformations (rescaling, re-centering, adding b-allele frequencies), but do not call integer copy numbers.

rescale:

  • Deprecated in favor of call with the -m none option, which does the same thing.
  • If recentering is specified with --center, do it before, not after, rescaling log2 values for tumor sample purity.

export bed, export vcf:

  • Take absolute copy number from "cn" column if present (#83)

antitarget:

  • Whitelist chromosomes X and Y along with integer chromosome names for inclusion as canonical mammalian chromosomes. Keep the fallback to "short" chromosome names if no such canonical chromosome names are detected. (#37)

reference:

  • Expose bias corrections (GC, RepeatMasker, targeting density) as command-line options --no-gc, --no-rmask, and --no-edge, similar to the fix command. (#80)

Internal:

  • VariantArray.read_vcf: somatic mask was the opposite of what it should have been, i.e. skip_somatic was skipping germline and retaining only somatic SNVs.

Version 0.7.7

25 Feb 02:19
Compare
Choose a tag to compare

Small improvements, bugfixes, and documentation updates.

fix:

  • Removed the hard filter on RepeatMasker fraction of antitarget bins. This filter doesn't appear to improve calling on current benchmarks.
  • Drop bins that have very high coverage in the reference, in addition to the low-coverage bins already dropped (normalized log2 values outside +/- 5).
  • Ignore very-low-coverage bins when recentering (by default). For good-quality samples this doesn't make much difference, but it's safer and seems to improve the centering slightly on lower-quality samples.
  • Ensure antitarget bin weights are not set to 0 if the majority of target bins have no coverage -- this would cause segmentation to fail. (#82)
  • Don't crash if antitargets are empty (to support WGS and targeted amplicon capture), fixing a regression.

antitarget:

  • Keep untargeted contigs that appear to be "canonical" chromosomes. Prefer chromosomes with numeric names (autosomes in most mammalian reference genomes); but if none of the targeted chromosomes have numeric names, then fall back to chromosomes with names no longer than the longest-named targeted chromosome. (#37)

batch:

  • Disallow input BAMs with duplicate base filenames (#81). Now it will trigger an error instead of overwriting some output files.

segment:

  • --drop-outlier option now masks outliers according to multiples (default 10x) of the 95'ile, not 90'ile. Benchmarking looks better.

Plots scatter, heatmap:

  • With the "-c/--chromosome" option, handle unbounded ranges (e.g. "chr1:100-" or "chr5:-100000") treating the missing start/end of the range as the start/end of the specified chromosome.

heatmap:

  • A more efficient implementation. Now, plotting a heatmap of .cnr is feasible, and behavior is a bit more consistent (e.g. placement of rectangles is more accurate; plotting a selection where only some samples have data will still show all samples).
  • Don't crash if selection overlaps no segments, e.g. if the selection is a centromeric or telomeric region. Previously it would crash with an obscure error.

Misc. bugfixes:

  • batch: log # parallel processes correctly for "-p 0"
  • import-theta: fix crash; namedtuples are immutable (#77)
  • metrics: require --segments (closes #79)
  • rescale: fix crash if --purity is not specified
  • VariantArray: Fix VCF parsing if filters are not used.

Version 0.7.6

03 Feb 20:49
Compare
Choose a tag to compare

Minor bugfixes and improvements.

scatter:

  • Tweaked plot colors for better visibility and accessibility: points are slightly darker, and segments are now a deep gold color instead of red.

fix:

  • Downweight targets or antitargets proportionally to their relative variability of bin log2 values; i.e. if targets are twice twice as variable (by interquartile range of bin log2 values) as antitargets, divide all target bin weights by 2. This happens after all bias corrections and reference normalization, and appears to improve the final segmentation results.

antitarget:

  • Don't emit antitargets for untargeted chromosomes with long names, e.g. "chr6_apd_hap1" -- these are presumably alternative/unassigned contigs, not real canonical chromosomes that deserve to be included for CNV calling. But do continue to keep untargeted chromosomes with names up to the length of the longest-named targeted chromosome. (Improves on #37)
  • Indicate default --min-size in the help message.

batch:

  • Log the number parallel processes correctly when "-p 0" is used to automatically detect the number of CPUs -- previously, this option would print on the console that samples were being run in serial, but then launch multiple parallel processes.

segment:

  • Change the --drop-outliers default value from 5 to 10, based on performance in benchmarking.

Internally:

  • Fixed detection of autosomes to be used for re-centering bin log2 values and detecting gender.
  • Fixed parsing the GATK/Picard "interval list" file format - strand and name were swapped.

v0.7.5: Version 0.7.5

16 Jan 00:21
Compare
Choose a tag to compare

Version 0.7.5

Global speedups, friendlier error handling and miscellaneous bug fixes.
Documentation updates (thanks @kyleabeauchamp; #67).
Expanded unit tests & restored continuous integration (TravisCI).
Raised the minimum pandas version to 0.17.1, the latest.

rescale (new command; #64):

  • Adjust .cnr or .cns files for normal contamination or subclone fraction.
  • Re-center log2 values by median (the usual), mode, mean, or biweight location.

segment:

  • Detect outlier bins and ignore them during segmentation using a method similar to BIC-seq. Command line option: --drop-outliers; any outlier bins found will be logged.

coverage:

  • If the given target BED files is missing the 4th column (gene names), fill in the dummy name "-" instead of crashing.

segmetrics:

  • Expose alpha and number of bootstraps as command-line options -a/--alpha and -b/--bootstrap for calculating confidence intervals.

antitarget:

  • Reduce default bin size from 150kb to 100kb.

fix:

  • Speed improvements: now about 20 times faster on exomes.

API changes:

  • Gene names to treat as meaningless and to ignore in reporting (by default "-", ".", "CGH") can be globally configured in cnvlib/params.py (params.IGNORE_GENE_NAMES).
  • vary.VariantArray (used in scatter) can now parse VCF files with no samples (genotypes) as a table of plain loci.

Version 0.7.4

10 Dec 18:02
Compare
Choose a tag to compare

This is primarily a bugfix release.

export:

  • bed --show variant now filters CNAs on sex chromosomes correctly, taking reference and sample genders into account.
  • nexus-ogt format now emits BAFs more similar to the original VCF allele frequencies. Previously, if multiple SNVs fell into a single CNVkit genomic bin, the allele frequencies of those SNVs would all be "mirrored" above 0.5 before taking the median. Now the SNVs are mirrored in the direction of the majority of the SNVs in the bin, whether above or below 0.5, so that the output looks more balanced and low-frequency SNVs are more apparent.

heatmap:

  • Sub-chromosomal regions can now be selected for display with the -c option, e.g. -c chr7:125000000-145000000, just like the same option in scatter.

segment:

  • Fix the listing of gene names in each segment in the output .cns file. Previously, briefly, each gene's name was truncated to 1 character.

Version 0.7.3

11 Nov 06:31
Compare
Choose a tag to compare

access:

  • New command equivalent to the now-deprecated genome2access.py script.

target, antitarget:

  • Always write output files in 4-column BED format.

scatter:

  • Copy ratios (.cnr) are no longer required. Without this input file, behavior is similar to the now-deprecated loh command, but still more flexible.
  • VCF input file can include multiple tumor samples and PEDIGREE tags; if a tumor sample ID is specified, all PEDIGREE tags will be checked to find the matching normal sample.
  • VCFs processed by CLC Genomics Server are now parsed correctly.

loh:

  • Deprecated. Use scatter with -v and no .cnr file instead.

segment:

  • Preliminary support for segmenting SNP allele frequencies from a VCF in addition to total copy number (-v option). Details are likely to change in a later release. (#34)
  • In the weight column of the output file, values are now the sum, not the mean, of the weights of the probes covered by that segment.
  • The haar segmentation method is improved to avoid duplicate breakpoints and run much faster.

export bed:

  • Deprecate --show-all in favor of --show with possible arguments all (like --show-all), ploidy (default behavior), or variant (show the same regions as export vcf).

export vcf:

  • Fix a typo in the SVLEN tag definition in the VCF header -- Number should be 1, not -1 which caused GATK parsing to fail. (#57; thanks @chapmanb)

Python library cnvlib:

  • Logging is now done with the Python standard library's logging module, making it easier to silence or redirect status messages. In particular, unit tests run more quietly. (#52)
  • Internal refactoring (including new features in GenomicArray, RegionArray, VariantArray) resulting in changes to the cnvlib API , as well as some performance improvements.

Version 0.7.2

09 Oct 20:58
Compare
Choose a tag to compare

A variety of mostly minor improvements and bug fixes over v0.7.1.

segment, gainloss, segmetrics:

  • Don't exclude very-low-coverage bins from calculations by default; instead,
    expose this option as --drop-low-coverage. (This option usually helps on
    tumor samples with some normal contamination, but leads to problems on
    germline samples with homozygous deletions.)

segment:

  • Output .cns files now have a "weight" column which is the mean of the weights
    of the bins it covers.
  • Output of the 'haar' segmentation method now has each segment's gene names
    listed, as with the other methods.
  • Fixed a bug where every segment's probe count (the "probes" column) could be
    overwritten with the _ character. (#53; thanks @chapmanb)

segmetrics:

  • Each statistic is now printed in its own column, instead of squeezing all
    stats into the "gene" column. The confidence/prediction interval stats get
    two columns, _lo and _hi (lower and upper bound).

loh, scatter:

  • Given a VCF called on a tumor-normal pair, use the paired normal to select
    appropriate germline SNPs for plotting.

export:

  • New format "nexus-ogt" combines bin-level copy number ratios with b-allele
    frequencies given a VCF and a .cnr file. This replaces "nexus-basic" with the
    -v option that was introduced in v0.7.1; "nexus-ogt" stores the same info
    but can be viewed in BioDiscovery Nexus Copy Number without any special
    configuration (load it as the "Custom-OGT" data format).
  • Renamed bed option --show-neutral to --show-all.
  • vcf option -g/--gender now works properly for identifying CNVs on sex
    chromosomes.

call:

  • Fixed the threshold method to calculate absolute copy number on sex
    chromosomes correctly. (#49; thanks @tskir)

Version 0.7.1

30 Sep 18:25
Compare
Choose a tag to compare

This is primarily a bugfix release. Many more unit test cases were added to the automated test suite. Code coverage is now monitored at Codecov (thanks @stevepeak).

export nexus-basic:

  • New optional argument -v/--vcf extracts SNV b-allele frequencies from the given VCF file, matches them to the bins in the .cnr file, and prints an additional "baf" column in the output table. These allele frequencies can then be viewed in Nexus Copy Number, similar to a SNP array.

call:

  • Fixed a bug in the threshold method where the copy number of haploid chromosomes was twice what it should be. The clonal method already handled these chromosomes properly. (#49)

reference:

  • Handle blank/empty antitarget BED and coverage (.cnn) files. This was a regression from earlier releases in v0.7.0. (#51)
  • When calculating GC and RepeatMasker values, catch invalid BED ranges that extend beyond the length of the chromosome and raise an informative error. This would error before, too (in ngfrills.faidx), but the message would be baffling.

fix:

  • Catch duplicated target ranges, e.g. the exact same bait labeled with two different gene names, and report those ranges in the error message. The target command's --split option should usually fix these, but sometimes it's not used.