Skip to content

Commit

Permalink
Cleaner flags for setting query regime (#450)
Browse files Browse the repository at this point in the history
* Query modes (labels, num. matches, counts, coords, ...)

* Removed the support for quantile computation -- can always be done in postprocessing

* Renamed flags `--discovery-fraction` -> `--min-kmers-fraction-label`, `--presence-fraction` -> `--min-kmers-fraction-graph`
  • Loading branch information
karasikov authored May 30, 2023
1 parent 61dc811 commit 666793b
Show file tree
Hide file tree
Showing 36 changed files with 275 additions and 256 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ Requires `M*V/8 + Size(BRWT)` bytes of RAM, where `M` is the number of rows in t
```bash
./metagraph query -v -i <GRAPH_DIR>/graph.dbg \
-a <GRAPH_DIR>/annotation.column.annodbg \
--discovery-fraction 0.8 --labels-delimiter ", " \
--min-kmers-fraction-label 0.8 --labels-delimiter ", " \
query_seq.fa
```

Expand Down
2 changes: 1 addition & 1 deletion metagraph/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
language = 'en'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand Down
28 changes: 21 additions & 7 deletions metagraph/docs/source/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -400,9 +400,9 @@ For more details, see section :ref:`transform_count_annotations`.

Once the annotation is transformed, k-mer abundances can be queried with::

metagraph query --query-counts ...
metagraph query --query-mode counts ...

Note that if flag ``--query-counts`` is not passed, the index will be queried in the default k-mer presence/absence regime.
Note that if flag ``--query-mode counts`` is not passed, the index will be queried in the default k-mer presence/absence regime.


.. _indexing coordinates:
Expand Down Expand Up @@ -438,10 +438,10 @@ Query k-mer coordinates
Once a coordinate-aware annotation is constructed, it can be transformed into a more memory-efficient representation supporting
querying (see :ref:`transform_coord_annotations` below) and then queried with::

metagraph query --query-coords ...
metagraph query --query-mode coords ...

As the coordinate-aware annotations also contain the information about k-mer abundance, they can be queried to retrieve k-mer counts (simply pass ``--query-counts`` instead of ``--query-coords``).
Note that if neither ``--query-coords`` nor ``--query-counts`` is passed, the index will be queried in the default k-mer presence/absence regime.
As the coordinate-aware annotations also contain the information about k-mer abundance, they can be queried to retrieve k-mer counts (simply pass ``--query-mode counts`` instead of ``--query-mode coords``).
Note that if neither ``--query-mode coords`` nor ``--query-mode counts`` is passed, the index will be queried in the default k-mer presence/absence regime.

.. _transform annotation:

Expand All @@ -457,6 +457,7 @@ compression performance and the complexity of the construction algorithm.
In contrast, ``RowDiff<Multi-BRWT>`` typically achieves
the best compression while still providing a good query performance, and thus, it is
recommended for very large problem instances.
Finally, ``RowDiff<RowSparse>`` provides a good trade-off between the query speed and compression performance.

Convert annotation to Rainbowfish
"""""""""""""""""""""""""""""""""
Expand Down Expand Up @@ -537,6 +538,7 @@ The conversion to ``RowDiff<Multi-BRWT>`` is done in two steps.
2. Transform the diff-transformed columns ``*.row_diff.annodbg`` to ``Multi-BRWT``::

find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \
-i graph.dbg \
--anno-type row_diff_brwt \
--greedy ...
metagraph relax_brwt -v -p 18 \
Expand All @@ -546,6 +548,19 @@ The conversion to ``RowDiff<Multi-BRWT>`` is done in two steps.

Also see the above paragraph :ref:`to_multi_brwt` for other options.


.. _to_row_diff_sparse:

Convert annotation to RowDiff<RowSparse>
"""""""""""""""""""""""""""""""""""""""""
The conversion to ``RowDiff<RowSparse>`` is similar to :ref:`to_row_diff_brwt`. The first step is the same.
In the second step, the diff-transformed columns ``*.row_diff.annodbg`` are converted to ``RowSparse``::

find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \
-i graph.dbg \
--anno-type row_diff_sparse


.. _transform_count_annotations:

Convert count-aware annotations
Expand Down Expand Up @@ -589,8 +604,7 @@ To query a MetaGraph index (graph + annotation) using the command line interface

metagraph query -i graph.dbg \
-a annotation.column.annodbg \
--count-kmers \
--discovery-fraction 0.1 \
--min-kmers-fraction-label 0.1 \
transcripts_1000.fa

For alignment, see ``metagraph align``.
Expand Down
2 changes: 1 addition & 1 deletion metagraph/docs/source/sequence_search.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ column being the corresponding node in the graph (or :code:`0` if not present).
For less verbose output, the additional :code:`--query-presence` and :code:`--count-kmers`
flags are available.

- :code:`--query-presence` outputs one line per sequence indicating whether the sequence is present (:code:`1`) or absent (:code:`0`). A sequence is considered to be present if its fraction of present k-mers is at least :code:`d`, as set by the :code:`--discovery-fraction` flag.
- :code:`--query-presence` outputs one line per sequence indicating whether the sequence is present (:code:`1`) or absent (:code:`0`). A sequence is considered to be present if its fraction of present k-mers is at least :code:`d`, as set by the :code:`--min-kmers-fraction-label` flag.
- :code:`--count-kmers` outputs one line per sequence in TSV format. The first column is the sequence header, while the second column is of the form :code:`a/b/c`, where :code:`a` is the number of matching k-mers, :code:`b` is the total number of k-mers, and :code:`c` is the total number of unique matching k-mers (where reverse complements are considered to be matching).

Sequence-to-graph alignment
Expand Down
Loading

0 comments on commit 666793b

Please sign in to comment.