Flipflop

nanoporetech · Dec 21, 2018 · 89f3984 · 89f3984
1 parent 91b8ff7
commit 89f3984
Show file tree

Hide file tree

Showing 10 changed files with 88 additions and 75 deletions.
diff --git a/README.md b/README.md
@@ -116,12 +116,17 @@ within the medaka environment, else they will need to be provided by the user.
     BASECALLS=basecalls.fa
     DRAFT=draft_assm/assm_final.fa
     OUTDIR=medaka_consensus
-    medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC}
+    medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r94
 
 The variables `BASECALLS`, `DRAFT`, and `OUTDIR` in the above should be set
 appropriately. When `medaka_consensus` has finished running, the consensus
 will be saved to `${OUTDIR}/consensus.fasta`.
 
+   **It is crucially important to specify the correct model, `-m` in the
+   above, according to the basecaller used. Allowed values can be found by
+   running `medaka consensus --help`.  For example to run medaka with a
+   model suitable for the flip-flop basecaller in Guppy use `-m r94_flip`.**
+
 Acknowledgements
 ----------------
 

diff --git a/docs/benchmarks.rst b/docs/benchmarks.rst
@@ -6,85 +6,55 @@ improved consensus from a pileup of reads.
 
 Results were obtained using the default model provided with `medaka`. This model
 was trained using data obtained from E.coli, S.cerevisaie and H.sapiens samples.
-Training data were basecalled using Guppy 0.3.0. Draft assemblies were created
-using the `mini_assemble
-<https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
-pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_. 
 
 Error statistics were calculated using the `pomoxis
-<https://github.com/nanoporetech/pomoxis>`_ program `stats_from_bam` after
+<https://github.com/nanoporetech/pomoxis>`_ program `assess_assembly` after
 aligning 100kb chunks of the consensus to the reference. Reported metrics are
 median values over all chunks. 
 
 
 Comparison of `medaka` and `nanopolish` 
 ---------------------------------------
 
-Evaluation of the model was performed using the `medaka` E.coli
-:doc:`walkthrough` dataset. These data we not used to train the model.
-Basecalling was performed with `scrappie v1.3.2
-<https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94` model. The
-pileup had a median depth of ~80-fold. `nanopolish v0.10.1
-<https://github.com/jts/nanopolish>`_ was run with homopolymer correction but
-without methylation correction. `medaka` and `nanopolish` were run on the same
-hardware.  
-
-+-----------------+--------+------------+
-|                 | medaka | nanopolish |
-+=================+========+============+
-| Q(Accuracy)     |  31.55 |  31.22     |
-+-----------------+--------+------------+
-| Q(Identity)     |  46.02 |  45.23     |
-+-----------------+--------+------------+
-| Q(Deletion)     |  32.60 |  31.81     |
-+-----------------+--------+------------+
-| Q(Insertion)    |  39.21 |  43.01     |
-+-----------------+--------+------------+
-| runtime (hours) |   0.27 |  1.05      |
-+-----------------+--------+------------+
-| CPU cores       |   4    |  48        |
-+-----------------+--------+------------+
-| CPU hours       |   1.06 |  50.4      |
-+-----------------+--------+------------+
-
-For this dataset `medaka` delivers similar results to `nanopolish` in a fraction
-of the time. 
+In this comparison the `medaka` E.coli :doc:`walkthrough` dataset was used.
+These data were not used to train the model.  Basecalling was performed using
+`Guppy v2.2.1`; both the older transducer and the newer flip-flop algorithm
+were used for comparison. Basecalled reads were trimmed using `porechop
+<https://github.com/rrwick/Porechop>`_ to remove adapters, and assembly was
+performed using `canu v1.8 <https://github.com/marbl/canu>`_. The assembly was
+corrected using `racon v1.3.1 <https://github.com/isovic/racon>`_ before being passed
+to `medaka` or `nanopolish`.  `nanopolish v0.10.1
+<https://github.com/jts/nanopolish>`_ was run using :code:`--fix-homopolymers` option.
+
++-----------------+----------------------------------------+----------------------------------------+
+|                 | **flipflop**                           | **transducer**                         |
++                 +--------------+----------+--------------+--------------+----------+--------------+
+|                 | *racon (x4)* | *medaka* | *nanopolish* | *racon (x4)* | *medaka* | *nanopolish* |
++-----------------+--------------+----------+--------------+--------------+----------+--------------+
+| Q(accuracy)     |         26.8 |   34.2   |         32.0 |         25.2 |     31.9 |         31.0 |
++-----------------+--------------+----------+--------------+--------------+----------+--------------+
+| Q(substitution) |         45.2 |   50.0   |         47.0 |         41.0 |     47.0 |         41.4 |
++-----------------+--------------+----------+--------------+--------------+----------+--------------+
+| Q(deletion)     |         27.1 |   34.0   |         32.6 |         25.6 |     35.0 |         30.7 |
++-----------------+--------------+----------+--------------+--------------+----------+--------------+
+| Q(insertion)    |         40.1 |   50.0   |         43.0 |         39.2 |     35.2 |         40.5 |
++-----------------+--------------+----------+--------------+--------------+----------+--------------+
+| CPU time / hrs  |        00:50 |  00:07   |        49:10 |        00:50 |    00:07 |        50:24 |
++-----------------+--------------+----------+--------------+--------------+----------+--------------+
+
+For this dataset the older transducer basecaller with `medaka` delivers similar
+results to `nanopolish` in a fraction of the time. The flip-flop workflow is
+seen to be superior to nanopolish. The runtime of `medaka` can be reduced
+further by utilizing a GPU, the runtime with a NVIDIA GTX1080Ti is found
+to be less than one minute!
 
 
 Evaluation across samples and depths
 ------------------------------------
 
-Evaluation of the model was performed using E.coli, H.sapiens chromosome 21, and
-`K.pneumoniae <https://github.com/rrwick/Basecalling-comparison>`_.  The E.coli
-and human reads were basecalled with `Guppy` version 0.3.0, while the Klebsiella
-reads were basecalled with `scrappie v1.3.2
-<https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94` model. The
-draft assemblies here were created at multiple depths using the `mini_assemble
-<https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
-pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_.
-
-+---------------------------+-----------------+------------------+----------------------+
-| Data set                  | Racon Error (%) | Medaka Error (%) | Nanopolish Error (%) |
-+===========================+=================+==================+======================+
-| E.coli 25X                |       0.291     |       0.125      |       0.127          |
-+---------------------------+-----------------+------------------+----------------------+
-| E.coli 50X                |       0.247     |       0.079      |       0.080          |
-+---------------------------+-----------------+------------------+----------------------+
-| E.coli  75X               |       0.238     |       0.065      |       0.069          |
-+---------------------------+-----------------+------------------+----------------------+
-| E.coli 100X               |       0.228     |       0.063      |       0.060          |
-+---------------------------+-----------------+------------------+----------------------+
-| E.coli 150X               |       0.231     |       0.065      |       0.057          |
-+---------------------------+-----------------+------------------+----------------------+
-| E.coli 200X               |       0.231     |       0.061      |       0.052          |
-+---------------------------+-----------------+------------------+----------------------+
-| L.brevis 250X             |       0.293     |       0.055      |       0.047          |
-+---------------------------+-----------------+------------------+----------------------+
-| K.pneumoniae [1]_ 200X    |       0.576     |       0.318      |       0.086          |
-+---------------------------+-----------------+------------------+----------------------+
+The comparison below illustrates results at various coverage depths for two further
+organisms. Assemblies were performed as above with canu and racon, using the flip-flop
+algorithm for basecalling. (nanopolish results to come...).
 
-.. [1] native (non-PCRed) data. Nanopolish was run with the --methylation-aware=dcm,dam
-       option.
+.. image:: images/cov_acc.png
 
-`medaka` produces similar results to Nanopolish (on PCRd data) in a fraction of
-the time, and provides a marked benefit over Racon on native DNA.
diff --git a/docs/images/cov_acc.png b/docs/images/cov_acc.png
diff --git a/docs/installation.rst b/docs/installation.rst
@@ -64,7 +64,7 @@ not be provided by the user.
 
 All installation methods will allow medaka to be used with CPU resource only.
 To enable the use of GPU resource it is necessary to install the
-`tensorflow-gpu` package. In outline this can be achieve with:
+`tensorflow-gpu` package. In outline this can be achieved with:
 
 .. code-block:: bash
 
@@ -96,8 +96,15 @@ within the medaka environment, else they will need to be provided by the user.
     BASECALLS=basecalls.fa
     DRAFT=draft_assm/assm_final.fa
     OUTDIR=medaka_consensus
-    medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC}
+    medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r94
 
 The variables `BASECALLS`, `DRAFT`, and `OUTDIR` in the above should be set
 appropriately. When `medaka_consensus` has finished running, the consensus
 will be saved to `${OUTDIR}/consensus.fasta`.
+
+.. warning::
+
+   It is crucially important to specify the correct model, :code:`-m` in the
+   above, according to the basecaller used. Allowed values can be found by
+   running :code:`medaka consensus --help`. For example to run medaka with a
+   model suitable for the flip-flop basecaller in Guppy use :code:`-m r94_flip`.
diff --git a/medaka/__init__.py b/medaka/__init__.py
@@ -1 +1 @@
-__version__ = '0.4.2'
+__version__ = '0.4.3'
diff --git a/medaka/data/hp_compress_model.hdf5 b/medaka/data/hp_compress_model.hdf5
diff --git a/medaka/data/medaka_model.hdf5 b/medaka/data/medaka_model.hdf5
diff --git a/medaka/data/r941_flip_model.hdf5 b/medaka/data/r941_flip_model.hdf5
diff --git a/medaka/medaka.py b/medaka/medaka.py
@@ -11,7 +11,38 @@
 from medaka.common import write_yaml_data
 from medaka.features import create_labelled_samples, create_samples
 
-default_model = os.path.join(resource_filename(__package__, 'data'), 'medaka_model.hdf5')
+model_store = resource_filename(__package__, 'data')
+
+model_dict = {
+  'r94': 'medaka_model.hdf5',
+  'r941_flip': 'r941_flip_model.hdf5' 
+}
+model_dict = {k:os.path.join(model_store, v) for k,v in model_dict.items()}
+default_model = 'r94'
+
+
+class ResolveModel(argparse.Action):
+    """Resolve model filename or ID into filename"""
+    def __init__(self, option_strings, dest, default=None, required=False, help='Model file.'):
+        super().__init__(
+            option_strings, dest, nargs=1, default=default, required=required,
+            help='{} {{{}}}'.format(help, ', '.join(model_dict.keys()))
+        )
+
+    def __call__(self, parser, namespace, values, option_string=None):
+        val = values[0]
+        if not os.path.exists(val):
+            # try lookup
+            try:
+                val = model_dict[val]
+            except:
+                raise RuntimeError(
+                    "Filepath for '--{}' argument does not exist and is not a known model ID ({})".format(
+                        self.dest, val)
+                )
+            #TODO: verify the file is a model?
+        setattr(namespace, self.dest, val)
+
 
 def _log_level():
     """Parser to set logging level and acquire software version/commit"""
@@ -36,7 +67,7 @@ def _chunking_feature_args():
     parser = argparse.ArgumentParser(
         formatter_class=argparse.ArgumentDefaultsHelpFormatter, add_help=False)
     parser.add_argument('bam', help='Input alignments.')
-    parser.add_argument('--model', default=default_model, help='Model definition.')
+    parser.add_argument('--model', action=ResolveModel, default=default_model, help='Model definition.')
     parser.add_argument('--batch_size', type=int, default=5, help='Inference batch size.')
     parser.add_argument('--regions', default=None, nargs='+', help='Genomic regions to analyse.')
     parser.add_argument('--chunk_len', type=int, default=10000, help='Chunk length of samples.')
@@ -90,7 +121,7 @@ def main():
     tparser.set_defaults(func=train)
     tparser.add_argument('features', nargs='+', help='Path for training data.')
     tparser.add_argument('--train_name', type=str, default='keras_train', help='Name for training run.')
-    tparser.add_argument('--model', help='Model definition and initial weights .hdf, or .yml with kwargs to build model.')
+    tparser.add_argument('--model', action=ResolveModel, help='Model definition and initial weights .hdf, or .yml with kwargs to build model.')
     tparser.add_argument('--max_label_len', type=int, default=1, help='Maximum label length.')
     tparser.add_argument('--epochs', type=int, default=5000, help='Maximum number of trainig epochs.')
     tparser.add_argument('--validation_split', type=float, default=0.2, help='Fraction of data to validate on.')
@@ -114,7 +145,7 @@ def main():
         parents=[_log_level()],
         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
     cfparser.add_argument('features', nargs='+', help='Pregenerated features (from medaka features).')
-    cfparser.add_argument('--model', default=default_model, help='Model definition.')
+    cfparser.add_argument('--model', action=ResolveModel, default=default_model, help='Model definition.')
     cfparser.add_argument('--ref_rle', default=None, help='Encoded reference file (required only for some model types.')
 
     # Post-processing of consensus outputs

diff --git a/scripts/medaka_consensus b/scripts/medaka_consensus
@@ -27,7 +27,7 @@ while getopts ':hi::d:o:m:t:b:' option; do
     i  ) iflag=true; BASECALLS=$(readlink -f $OPTARG);;
     d  ) dflag=true; DRAFT=$(readlink -f $OPTARG);;
     o  ) OUTPUT=$OPTARG;;
-    m  ) MODEL=$(readlink -f $OPTARG);;
+    m  ) MODEL=$OPTARG;;
     t  ) THREADS=$OPTARG;;
     b  ) BATCH_SIZE=$OPTARG;;
     \? ) echo "Invalid option: -${OPTARG}." >&2; exit 1;;