Skip to content

Commit

Permalink
doc: expanded the misc. guidance in quickstart, tumor, germline
Browse files Browse the repository at this point in the history
  • Loading branch information
etal committed Sep 13, 2016
1 parent 7e2afe6 commit d63dfa9
Show file tree
Hide file tree
Showing 4 changed files with 93 additions and 24 deletions.
35 changes: 27 additions & 8 deletions doc/germline.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,33 @@
Germline analysis
=================

.. TODO - see e-mails, biostars, notes
CNVkit can be used with exome sequencing of constitutional (non-tumor) samples,
for example to detect germline copy number alterations associated with heritable
conditions. However, note that CNVkit is less accurate in detecting CNVs
smaller than 1 Mbp, typically only detecting variants that span multiple exons
or captured regions. When used on exome or target panel datasets, CNVkit will
not detect the small CNVs that are more common in populations.

CNVkit is less accurate in detecting CNVs smaller than 1 Mbp.
To use CNVkit to detect medium-to-large CNVs or unbalanced SVs in constitutional
samples:

The ``--drop-low-coverage`` option (see :doc:`tumor`) should not be used; it
will typically remove germline deep deletions altogether, which is not
desirable.
- The :ref:`call` command can be used directly without specifying the
``--purity`` and ``--ploidy`` values, as the defaults will be correct for
mammalian cells. (For non-diploid species, use the correct ``--ploidy``, of
course.) The default ``--method threshold`` assigns integer copy number
similarly to ``--method clonal``, but with smaller thresholds for calling
single-copy changes. The default thresholds allow for mosaicism in CNVs, which
have smaller log2 value than a single-copy CNV would indicate. (They're more
common than often thought.)

Watch for mosaicism in CNVs, resulting in non-integer copy numbers (i.e. smaller
log2 value than a single-copy CNV would indicate); they're more common than
often thought.
- The ``--filter`` option in :ref:`call` can be used to reduce the number of
false-positive segments returned. To use the ``ci`` (recommended) or ``sem``
filters, first run each sample's segmented .cns file through :ref:`segmetrics`
with the ``--ci`` option, which adds upper and lower confidence limits to the
.cns output that ``call --filter ci`` can then use.

- The ``--drop-low-coverage`` option (see :doc:`tumor`) should not be used; it
will typically remove germline deep deletions altogether, which is not
desirable.

- For using CNVkit with whole-genome sequencing datasets, see :doc:`nonhybrid`.
2 changes: 1 addition & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,11 @@ FAQ & How To
:maxdepth: 2

calling
gender
tumor
heterogeneity
germline
nonhybrid
gender
bias
fileformats

Expand Down
29 changes: 27 additions & 2 deletions doc/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,8 +145,8 @@ If your targets are missing gene names, you can add them here with the
target BED file.


Process more tumor samples
--------------------------
Next steps
----------

You can reuse the reference file you've previously constructed to extract copy
number information from additional tumor sample BAM files, without repeating the
Expand All @@ -161,6 +161,31 @@ The coordinates of the target and antitarget bins, the gene names for the
targets, and the GC and RepeatMasker information for bias corrections are
automatically extracted from the reference .cnn file you've built.

Now, starting a project from scratch, you could follow any of these approaches:

- Run ``batch`` as above with all tumor/test and normal/control samples
specified as they are, and hope for the best. (This should usually work fine.)
- *For the careful:* Run ``batch`` with just the normal samples specified as
normal, yielding coverage .cnn files and a **pooled reference**. Inspect the
coverages of all samples with the :ref:`metrics` command, eliminating any
poor-quality samples and choosing a larger or smaller antitarget bin size if
necessary. Build an updated pooled reference using :ref:`batch` or
:ref:`coverage` and :ref:`reference` (see :doc:`pipeline`), coordinating your
work in a `Makefile <https://en.wikipedia.org/wiki/Makefile>`_, Rakefile, or
similar build tool.

- See also: `Ten Simple Rules for Reproducible Computational Research
<http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285>`_

- *For the power user:* Run ``batch`` with all samples specified as tumor
samples, using ``-n`` by itself to build a **flat reference**, yielding
coverages, copy ratios, segments and optionally plots for all samples, both
tumor and normal. Inspect the "rough draft" outputs and determine an
appropriate strategy to build and use a **pooled reference** to re-analyze the
samples -- ideally coordinated with a build tool as above.
- Use a framework like `bcbio-nextgen <https://bcbio-nextgen.readthedocs.io/>`_
to coordinate the complete sequencing data analysis pipeline.

See the command-line usage pages for additional
:doc:`visualization <plots>`,
:doc:`reporting <reports>` and
Expand Down
51 changes: 38 additions & 13 deletions doc/tumor.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,42 @@
Tumor analysis
==============

Solid tumor samples: Use ``--drop-low-coverage`` in the :ref:`batch` and
:ref:`segment` commands. Virtually all tumor samples, even cancer cell lines,
are not completely homogeneous. Even in regions of homozygous deletion in the
largest tumor-cell clonal population, some sequencing reads will be obtained
from contaminating normal cells without the deletion.
Therefore, extremely low log2 copy ratio values (below -15) do not indicate
homozygous deletions but failed sequencing or mapping in all cells regardless
of copy number status at that site, which are not informative for copy number.
This option in the :ref:`batch` command applies to segmentation; the option is
also available in the :ref:`segment`, :ref:`metrics`, :ref:`segmetrics`,
:ref:`gainloss` and :doc:`heterogeneity` commands.
CNVkit has been used most extensively on solid tumor samples sequenced with a
target panel or whole-exome sequencing protocol. Several options and approaches
are available to support this use case:

If you have unpaired tumor samples, or no normal samples sequenced on the
same platform, see the :ref:`reference` command for strategies.
- If you have unpaired tumor samples, or no normal samples sequenced on the same
platform, see the :ref:`reference` command for strategies.

- Use ``--drop-low-coverage`` to ignore bins with log2 normalized coverage
values below -15. Virtually all tumor samples, even cancer cell lines, are
not completely homogeneous. Even in regions of homozygous deletion in the
largest tumor-cell clonal population, some sequencing reads will be obtained
from contaminating normal cells without the deletion. Therefore, extremely low
log2 copy ratio values do not indicate homozygous deletions but failed
sequencing or mapping in all cells regardless of copy number status at that
site, which are not informative for copy number. This option in the
:ref:`batch` command applies to segmentation; the option is also available in
the :ref:`segment`, :ref:`metrics`, :ref:`segmetrics`, :ref:`gainloss` and
:doc:`heterogeneity` commands.

- Why -15? The null log2 value substituted for bins with zero coverage is
-20 (about 1 millionth the average bin's coverage), and the maximum
positive shift that can be introduced by normalizing to the reference is 5
(for bins with 1/32 the average coverage; bins below this are masked out
by the reference). In a .cnr file, any bins with log2 value below -15 are
probably based on dummy values corresponding to zero-coverage (perhaps
unmappable) bins, and not real observations.

- The :ref:`batch` command does not directly output integer copy number calls
(see :doc:`heterogeneity`). Instead, use the ``--ploidy`` and ``--purity``
options in :ref:`call` to calculate copy number for each sample individually
using known or estimated tumor-cell fractions. Also consider using ``--center
median`` in highly aneuploid samples to shift the log2 value of true neutral
regions closer to zero, as it may be slightly off initially.

- If SNV calls are available in VCF format, use the ``-v``/``--vcf`` option in
the :ref:`call` and :ref:`scatter` commands to calculate or plot b-allele
frequencies alongside each segment's total copy number or log2 ratio. These
values reveal allelic imbalance and loss of heterozygosity (LOH), supporting
and extending the inferred CNVs.

0 comments on commit d63dfa9

Please sign in to comment.