Skip to content

Commit

Permalink
Flipflop
Browse files Browse the repository at this point in the history
  • Loading branch information
cjw85 committed Dec 21, 2018
1 parent 91b8ff7 commit 89f3984
Show file tree
Hide file tree
Showing 10 changed files with 88 additions and 75 deletions.
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,12 +116,17 @@ within the medaka environment, else they will need to be provided by the user.
BASECALLS=basecalls.fa
DRAFT=draft_assm/assm_final.fa
OUTDIR=medaka_consensus
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC}
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r94

The variables `BASECALLS`, `DRAFT`, and `OUTDIR` in the above should be set
appropriately. When `medaka_consensus` has finished running, the consensus
will be saved to `${OUTDIR}/consensus.fasta`.

**It is crucially important to specify the correct model, `-m` in the
above, according to the basecaller used. Allowed values can be found by
running `medaka consensus --help`. For example to run medaka with a
model suitable for the flip-flop basecaller in Guppy use `-m r94_flip`.**

Acknowledgements
----------------

Expand Down
102 changes: 36 additions & 66 deletions docs/benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,85 +6,55 @@ improved consensus from a pileup of reads.

Results were obtained using the default model provided with `medaka`. This model
was trained using data obtained from E.coli, S.cerevisaie and H.sapiens samples.
Training data were basecalled using Guppy 0.3.0. Draft assemblies were created
using the `mini_assemble
<https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_.

Error statistics were calculated using the `pomoxis
<https://github.com/nanoporetech/pomoxis>`_ program `stats_from_bam` after
<https://github.com/nanoporetech/pomoxis>`_ program `assess_assembly` after
aligning 100kb chunks of the consensus to the reference. Reported metrics are
median values over all chunks.


Comparison of `medaka` and `nanopolish`
---------------------------------------

Evaluation of the model was performed using the `medaka` E.coli
:doc:`walkthrough` dataset. These data we not used to train the model.
Basecalling was performed with `scrappie v1.3.2
<https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94` model. The
pileup had a median depth of ~80-fold. `nanopolish v0.10.1
<https://github.com/jts/nanopolish>`_ was run with homopolymer correction but
without methylation correction. `medaka` and `nanopolish` were run on the same
hardware.

+-----------------+--------+------------+
| | medaka | nanopolish |
+=================+========+============+
| Q(Accuracy) | 31.55 | 31.22 |
+-----------------+--------+------------+
| Q(Identity) | 46.02 | 45.23 |
+-----------------+--------+------------+
| Q(Deletion) | 32.60 | 31.81 |
+-----------------+--------+------------+
| Q(Insertion) | 39.21 | 43.01 |
+-----------------+--------+------------+
| runtime (hours) | 0.27 | 1.05 |
+-----------------+--------+------------+
| CPU cores | 4 | 48 |
+-----------------+--------+------------+
| CPU hours | 1.06 | 50.4 |
+-----------------+--------+------------+

For this dataset `medaka` delivers similar results to `nanopolish` in a fraction
of the time.
In this comparison the `medaka` E.coli :doc:`walkthrough` dataset was used.
These data were not used to train the model. Basecalling was performed using
`Guppy v2.2.1`; both the older transducer and the newer flip-flop algorithm
were used for comparison. Basecalled reads were trimmed using `porechop
<https://github.com/rrwick/Porechop>`_ to remove adapters, and assembly was
performed using `canu v1.8 <https://github.com/marbl/canu>`_. The assembly was
corrected using `racon v1.3.1 <https://github.com/isovic/racon>`_ before being passed
to `medaka` or `nanopolish`. `nanopolish v0.10.1
<https://github.com/jts/nanopolish>`_ was run using :code:`--fix-homopolymers` option.

+-----------------+----------------------------------------+----------------------------------------+
| | **flipflop** | **transducer** |
+ +--------------+----------+--------------+--------------+----------+--------------+
| | *racon (x4)* | *medaka* | *nanopolish* | *racon (x4)* | *medaka* | *nanopolish* |
+-----------------+--------------+----------+--------------+--------------+----------+--------------+
| Q(accuracy) | 26.8 | 34.2 | 32.0 | 25.2 | 31.9 | 31.0 |
+-----------------+--------------+----------+--------------+--------------+----------+--------------+
| Q(substitution) | 45.2 | 50.0 | 47.0 | 41.0 | 47.0 | 41.4 |
+-----------------+--------------+----------+--------------+--------------+----------+--------------+
| Q(deletion) | 27.1 | 34.0 | 32.6 | 25.6 | 35.0 | 30.7 |
+-----------------+--------------+----------+--------------+--------------+----------+--------------+
| Q(insertion) | 40.1 | 50.0 | 43.0 | 39.2 | 35.2 | 40.5 |
+-----------------+--------------+----------+--------------+--------------+----------+--------------+
| CPU time / hrs | 00:50 | 00:07 | 49:10 | 00:50 | 00:07 | 50:24 |
+-----------------+--------------+----------+--------------+--------------+----------+--------------+

For this dataset the older transducer basecaller with `medaka` delivers similar
results to `nanopolish` in a fraction of the time. The flip-flop workflow is
seen to be superior to nanopolish. The runtime of `medaka` can be reduced
further by utilizing a GPU, the runtime with a NVIDIA GTX1080Ti is found
to be less than one minute!


Evaluation across samples and depths
------------------------------------

Evaluation of the model was performed using E.coli, H.sapiens chromosome 21, and
`K.pneumoniae <https://github.com/rrwick/Basecalling-comparison>`_. The E.coli
and human reads were basecalled with `Guppy` version 0.3.0, while the Klebsiella
reads were basecalled with `scrappie v1.3.2
<https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94` model. The
draft assemblies here were created at multiple depths using the `mini_assemble
<https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_.

+---------------------------+-----------------+------------------+----------------------+
| Data set | Racon Error (%) | Medaka Error (%) | Nanopolish Error (%) |
+===========================+=================+==================+======================+
| E.coli 25X | 0.291 | 0.125 | 0.127 |
+---------------------------+-----------------+------------------+----------------------+
| E.coli 50X | 0.247 | 0.079 | 0.080 |
+---------------------------+-----------------+------------------+----------------------+
| E.coli 75X | 0.238 | 0.065 | 0.069 |
+---------------------------+-----------------+------------------+----------------------+
| E.coli 100X | 0.228 | 0.063 | 0.060 |
+---------------------------+-----------------+------------------+----------------------+
| E.coli 150X | 0.231 | 0.065 | 0.057 |
+---------------------------+-----------------+------------------+----------------------+
| E.coli 200X | 0.231 | 0.061 | 0.052 |
+---------------------------+-----------------+------------------+----------------------+
| L.brevis 250X | 0.293 | 0.055 | 0.047 |
+---------------------------+-----------------+------------------+----------------------+
| K.pneumoniae [1]_ 200X | 0.576 | 0.318 | 0.086 |
+---------------------------+-----------------+------------------+----------------------+
The comparison below illustrates results at various coverage depths for two further
organisms. Assemblies were performed as above with canu and racon, using the flip-flop
algorithm for basecalling. (nanopolish results to come...).

.. [1] native (non-PCRed) data. Nanopolish was run with the --methylation-aware=dcm,dam
option.
.. image:: images/cov_acc.png

`medaka` produces similar results to Nanopolish (on PCRd data) in a fraction of
the time, and provides a marked benefit over Racon on native DNA.
Binary file added docs/images/cov_acc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 9 additions & 2 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ not be provided by the user.

All installation methods will allow medaka to be used with CPU resource only.
To enable the use of GPU resource it is necessary to install the
`tensorflow-gpu` package. In outline this can be achieve with:
`tensorflow-gpu` package. In outline this can be achieved with:

.. code-block:: bash
Expand Down Expand Up @@ -96,8 +96,15 @@ within the medaka environment, else they will need to be provided by the user.
BASECALLS=basecalls.fa
DRAFT=draft_assm/assm_final.fa
OUTDIR=medaka_consensus
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC}
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r94
The variables `BASECALLS`, `DRAFT`, and `OUTDIR` in the above should be set
appropriately. When `medaka_consensus` has finished running, the consensus
will be saved to `${OUTDIR}/consensus.fasta`.

.. warning::

It is crucially important to specify the correct model, :code:`-m` in the
above, according to the basecaller used. Allowed values can be found by
running :code:`medaka consensus --help`. For example to run medaka with a
model suitable for the flip-flop basecaller in Guppy use :code:`-m r94_flip`.
2 changes: 1 addition & 1 deletion medaka/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.4.2'
__version__ = '0.4.3'
Empty file modified medaka/data/hp_compress_model.hdf5
100755 → 100644
Empty file.
Empty file modified medaka/data/medaka_model.hdf5
100755 → 100644
Empty file.
Binary file added medaka/data/r941_flip_model.hdf5
Binary file not shown.
39 changes: 35 additions & 4 deletions medaka/medaka.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,38 @@
from medaka.common import write_yaml_data
from medaka.features import create_labelled_samples, create_samples

default_model = os.path.join(resource_filename(__package__, 'data'), 'medaka_model.hdf5')
model_store = resource_filename(__package__, 'data')

model_dict = {
'r94': 'medaka_model.hdf5',
'r941_flip': 'r941_flip_model.hdf5'
}
model_dict = {k:os.path.join(model_store, v) for k,v in model_dict.items()}
default_model = 'r94'


class ResolveModel(argparse.Action):
"""Resolve model filename or ID into filename"""
def __init__(self, option_strings, dest, default=None, required=False, help='Model file.'):
super().__init__(
option_strings, dest, nargs=1, default=default, required=required,
help='{} {{{}}}'.format(help, ', '.join(model_dict.keys()))
)

def __call__(self, parser, namespace, values, option_string=None):
val = values[0]
if not os.path.exists(val):
# try lookup
try:
val = model_dict[val]
except:
raise RuntimeError(
"Filepath for '--{}' argument does not exist and is not a known model ID ({})".format(
self.dest, val)
)
#TODO: verify the file is a model?
setattr(namespace, self.dest, val)


def _log_level():
"""Parser to set logging level and acquire software version/commit"""
Expand All @@ -36,7 +67,7 @@ def _chunking_feature_args():
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter, add_help=False)
parser.add_argument('bam', help='Input alignments.')
parser.add_argument('--model', default=default_model, help='Model definition.')
parser.add_argument('--model', action=ResolveModel, default=default_model, help='Model definition.')
parser.add_argument('--batch_size', type=int, default=5, help='Inference batch size.')
parser.add_argument('--regions', default=None, nargs='+', help='Genomic regions to analyse.')
parser.add_argument('--chunk_len', type=int, default=10000, help='Chunk length of samples.')
Expand Down Expand Up @@ -90,7 +121,7 @@ def main():
tparser.set_defaults(func=train)
tparser.add_argument('features', nargs='+', help='Path for training data.')
tparser.add_argument('--train_name', type=str, default='keras_train', help='Name for training run.')
tparser.add_argument('--model', help='Model definition and initial weights .hdf, or .yml with kwargs to build model.')
tparser.add_argument('--model', action=ResolveModel, help='Model definition and initial weights .hdf, or .yml with kwargs to build model.')
tparser.add_argument('--max_label_len', type=int, default=1, help='Maximum label length.')
tparser.add_argument('--epochs', type=int, default=5000, help='Maximum number of trainig epochs.')
tparser.add_argument('--validation_split', type=float, default=0.2, help='Fraction of data to validate on.')
Expand All @@ -114,7 +145,7 @@ def main():
parents=[_log_level()],
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
cfparser.add_argument('features', nargs='+', help='Pregenerated features (from medaka features).')
cfparser.add_argument('--model', default=default_model, help='Model definition.')
cfparser.add_argument('--model', action=ResolveModel, default=default_model, help='Model definition.')
cfparser.add_argument('--ref_rle', default=None, help='Encoded reference file (required only for some model types.')

# Post-processing of consensus outputs
Expand Down
2 changes: 1 addition & 1 deletion scripts/medaka_consensus
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ while getopts ':hi::d:o:m:t:b:' option; do
i ) iflag=true; BASECALLS=$(readlink -f $OPTARG);;
d ) dflag=true; DRAFT=$(readlink -f $OPTARG);;
o ) OUTPUT=$OPTARG;;
m ) MODEL=$(readlink -f $OPTARG);;
m ) MODEL=$OPTARG;;
t ) THREADS=$OPTARG;;
b ) BATCH_SIZE=$OPTARG;;
\? ) echo "Invalid option: -${OPTARG}." >&2; exit 1;;
Expand Down

0 comments on commit 89f3984

Please sign in to comment.