From d63dfa94e9cfb1600ee099b8f11d8d94b87c4d43 Mon Sep 17 00:00:00 2001
From: Eric Talevich <eric.talevich@gmail.com>
Date: Tue, 13 Sep 2016 15:22:14 -0700
Subject: [PATCH] doc: expanded the misc. guidance in quickstart, tumor,
 germline

---
 doc/germline.rst   | 35 +++++++++++++++++++++++--------
 doc/index.rst      |  2 +-
 doc/quickstart.rst | 29 ++++++++++++++++++++++++--
 doc/tumor.rst      | 51 ++++++++++++++++++++++++++++++++++------------
 4 files changed, 93 insertions(+), 24 deletions(-)

diff --git a/doc/germline.rst b/doc/germline.rst
index 09d11de9..c4094a0f 100644
--- a/doc/germline.rst
+++ b/doc/germline.rst
@@ -1,14 +1,33 @@
 Germline analysis
 =================
 
-.. TODO - see e-mails, biostars, notes
+CNVkit can be used with exome sequencing of constitutional (non-tumor) samples,
+for example to detect germline copy number alterations associated with heritable
+conditions. However, note that CNVkit is less accurate in detecting CNVs
+smaller than 1 Mbp, typically only detecting variants that span multiple exons
+or captured regions.  When used on exome or target panel datasets, CNVkit will
+not detect the small CNVs that are more common in populations.
 
-CNVkit is less accurate in detecting CNVs smaller than 1 Mbp.
+To use CNVkit to detect medium-to-large CNVs or unbalanced SVs in constitutional
+samples:
 
-The ``--drop-low-coverage`` option (see :doc:`tumor`) should not be used; it
-will typically remove germline deep deletions altogether, which is not
-desirable.
+- The :ref:`call` command can be used directly without specifying the
+  ``--purity`` and ``--ploidy`` values, as the defaults will be correct for
+  mammalian cells. (For non-diploid species, use the correct ``--ploidy``, of
+  course.) The default ``--method threshold`` assigns integer copy number
+  similarly to ``--method clonal``, but with smaller thresholds for calling
+  single-copy changes. The default thresholds allow for mosaicism in CNVs, which
+  have smaller log2 value than a single-copy CNV would indicate. (They're more
+  common than often thought.)
 
-Watch for mosaicism in CNVs, resulting in non-integer copy numbers (i.e. smaller
-log2 value than a single-copy CNV would indicate); they're more common than
-often thought.
+- The ``--filter`` option in :ref:`call` can be used to reduce the number of
+  false-positive segments returned. To use the ``ci`` (recommended) or ``sem``
+  filters, first run each sample's segmented .cns file through :ref:`segmetrics`
+  with the ``--ci`` option, which adds upper and lower confidence limits to the
+  .cns output that ``call --filter ci`` can then use.
+
+- The ``--drop-low-coverage`` option (see :doc:`tumor`) should not be used; it
+  will typically remove germline deep deletions altogether, which is not
+  desirable.
+
+- For using CNVkit with whole-genome sequencing datasets, see :doc:`nonhybrid`.
diff --git a/doc/index.rst b/doc/index.rst
index 2af59dcc..ab502a3e 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -42,11 +42,11 @@ FAQ & How To
     :maxdepth: 2
 
     calling
-    gender
     tumor
     heterogeneity
     germline
     nonhybrid
+    gender
     bias
     fileformats
 
diff --git a/doc/quickstart.rst b/doc/quickstart.rst
index 5b37d7ed..fa8faca1 100644
--- a/doc/quickstart.rst
+++ b/doc/quickstart.rst
@@ -145,8 +145,8 @@ If your targets are missing gene names, you can add them here with the
       target BED file.
 
 
-Process more tumor samples
---------------------------
+Next steps
+----------
 
 You can reuse the reference file you've previously constructed to extract copy
 number information from additional tumor sample BAM files, without repeating the
@@ -161,6 +161,31 @@ The coordinates of the target and antitarget bins, the gene names for the
 targets, and the GC and RepeatMasker information for bias corrections are
 automatically extracted from the reference .cnn file you've built.
 
+Now, starting a project from scratch, you could follow any of these approaches:
+
+- Run ``batch`` as above with all tumor/test and normal/control samples
+  specified as they are, and hope for the best. (This should usually work fine.)
+- *For the careful:* Run ``batch`` with just the normal samples specified as
+  normal, yielding coverage .cnn files and a **pooled reference**. Inspect the
+  coverages of all samples with the :ref:`metrics` command, eliminating any
+  poor-quality samples and choosing a larger or smaller antitarget bin size if
+  necessary. Build an updated pooled reference using :ref:`batch` or
+  :ref:`coverage` and :ref:`reference` (see :doc:`pipeline`), coordinating your
+  work in a `Makefile <https://en.wikipedia.org/wiki/Makefile>`_, Rakefile, or
+  similar build tool.
+
+    - See also: `Ten Simple Rules for Reproducible Computational Research
+      <http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285>`_
+
+- *For the power user:* Run ``batch`` with all samples specified as tumor
+  samples, using ``-n`` by itself to build a **flat reference**, yielding
+  coverages, copy ratios, segments and optionally plots for all samples, both
+  tumor and normal. Inspect the "rough draft" outputs and determine an
+  appropriate strategy to build and use a **pooled reference** to re-analyze the
+  samples -- ideally coordinated with a build tool as above.
+- Use a framework like `bcbio-nextgen <https://bcbio-nextgen.readthedocs.io/>`_
+  to coordinate the complete sequencing data analysis pipeline.
+
 See the command-line usage pages for additional
 :doc:`visualization <plots>`,
 :doc:`reporting <reports>` and
diff --git a/doc/tumor.rst b/doc/tumor.rst
index c6e5024d..a69f8915 100644
--- a/doc/tumor.rst
+++ b/doc/tumor.rst
@@ -1,17 +1,42 @@
 Tumor analysis
 ==============
 
-Solid tumor samples: Use ``--drop-low-coverage`` in the :ref:`batch` and
-:ref:`segment` commands. Virtually all tumor samples, even cancer cell lines,
-are not completely homogeneous. Even in regions of homozygous deletion in the
-largest tumor-cell clonal population, some sequencing reads will be obtained
-from contaminating normal cells without the deletion.
-Therefore, extremely low log2 copy ratio values (below -15) do not indicate
-homozygous deletions but failed sequencing or mapping in all cells regardless
-of copy number status at that site, which are not informative for copy number.
-This option in the :ref:`batch` command applies to segmentation; the option is
-also available in the :ref:`segment`, :ref:`metrics`, :ref:`segmetrics`,
-:ref:`gainloss` and :doc:`heterogeneity` commands.
+CNVkit has been used most extensively on solid tumor samples sequenced with a
+target panel or whole-exome sequencing protocol. Several options and approaches
+are available to support this use case:
 
-If you have unpaired tumor samples, or no normal samples sequenced on the
-same platform, see the :ref:`reference` command for strategies.
+- If you have unpaired tumor samples, or no normal samples sequenced on the same
+  platform, see the :ref:`reference` command for strategies.
+
+- Use ``--drop-low-coverage`` to ignore bins with log2 normalized coverage
+  values below -15.  Virtually all tumor samples, even cancer cell lines, are
+  not completely homogeneous. Even in regions of homozygous deletion in the
+  largest tumor-cell clonal population, some sequencing reads will be obtained
+  from contaminating normal cells without the deletion. Therefore, extremely low
+  log2 copy ratio values do not indicate homozygous deletions but failed
+  sequencing or mapping in all cells regardless of copy number status at that
+  site, which are not informative for copy number. This option in the
+  :ref:`batch` command applies to segmentation; the option is also available in
+  the :ref:`segment`, :ref:`metrics`, :ref:`segmetrics`, :ref:`gainloss` and
+  :doc:`heterogeneity` commands.
+
+    - Why -15? The null log2 value substituted for bins with zero coverage is
+      -20 (about 1 millionth the average bin's coverage), and the maximum
+      positive shift that can be introduced by normalizing to the reference is 5
+      (for bins with 1/32 the average coverage; bins below this are masked out
+      by the reference). In a .cnr file, any bins with log2 value below -15 are
+      probably based on dummy values corresponding to zero-coverage (perhaps
+      unmappable) bins, and not real observations.
+
+- The :ref:`batch` command does not directly output integer copy number calls
+  (see :doc:`heterogeneity`). Instead, use the ``--ploidy`` and ``--purity``
+  options in :ref:`call` to calculate copy number for each sample individually
+  using known or estimated tumor-cell fractions. Also consider using ``--center
+  median`` in highly aneuploid samples to shift the log2 value of true neutral
+  regions closer to zero, as it may be slightly off initially.
+
+- If SNV calls are available in VCF format, use the ``-v``/``--vcf`` option in
+  the :ref:`call` and :ref:`scatter` commands to calculate or plot b-allele
+  frequencies alongside each segment's total copy number or log2 ratio. These
+  values reveal allelic imbalance and loss of heterozygosity (LOH), supporting
+  and extending the inferred CNVs.