Cleaner flags for setting query regime (#450)

* Query modes (labels, num. matches, counts, coords, ...) * Removed the support for quantile computation -- can always be done in postprocessing * Renamed flags `--discovery-fraction` -> `--min-kmers-fraction-label`, `--presence-fraction` -> `--min-kmers-fraction-graph`
ratschlab · May 30, 2023 · 666793b · 666793b
1 parent 61dc811
commit 666793b
Show file tree

Hide file tree

Showing 36 changed files with 275 additions and 256 deletions.
diff --git a/README.md b/README.md
@@ -152,7 +152,7 @@ Requires `M*V/8 + Size(BRWT)` bytes of RAM, where `M` is the number of rows in t
 ```bash
 ./metagraph query -v -i <GRAPH_DIR>/graph.dbg \
                         -a <GRAPH_DIR>/annotation.column.annodbg \
-                        --discovery-fraction 0.8 --labels-delimiter ", " \
+                        --min-kmers-fraction-label 0.8 --labels-delimiter ", " \
                         query_seq.fa
 ```
 

diff --git a/metagraph/docs/source/conf.py b/metagraph/docs/source/conf.py
@@ -60,7 +60,7 @@
 #
 # This is also used if you do content translation via gettext catalogs.
 # Usually you set "language" from the command line for these cases.
-language = None
+language = 'en'
 
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.

diff --git a/metagraph/docs/source/quick_start.rst b/metagraph/docs/source/quick_start.rst
@@ -400,9 +400,9 @@ For more details, see section :ref:`transform_count_annotations`.
 
 Once the annotation is transformed, k-mer abundances can be queried with::
 
-    metagraph query --query-counts ...
+    metagraph query --query-mode counts ...
 
-Note that if flag ``--query-counts`` is not passed, the index will be queried in the default k-mer presence/absence regime.
+Note that if flag ``--query-mode counts`` is not passed, the index will be queried in the default k-mer presence/absence regime.
 
 
 .. _indexing coordinates:
@@ -438,10 +438,10 @@ Query k-mer coordinates
 Once a coordinate-aware annotation is constructed, it can be transformed into a more memory-efficient representation supporting
 querying (see :ref:`transform_coord_annotations` below) and then queried with::
 
-    metagraph query --query-coords ...
+    metagraph query --query-mode coords ...
 
-As the coordinate-aware annotations also contain the information about k-mer abundance, they can be queried to retrieve k-mer counts (simply pass ``--query-counts`` instead of ``--query-coords``).
-Note that if neither ``--query-coords`` nor ``--query-counts`` is passed, the index will be queried in the default k-mer presence/absence regime.
+As the coordinate-aware annotations also contain the information about k-mer abundance, they can be queried to retrieve k-mer counts (simply pass ``--query-mode counts`` instead of ``--query-mode coords``).
+Note that if neither ``--query-mode coords`` nor ``--query-mode counts`` is passed, the index will be queried in the default k-mer presence/absence regime.
 
 .. _transform annotation:
 
@@ -457,6 +457,7 @@ compression performance and the complexity of the construction algorithm.
 In contrast, ``RowDiff<Multi-BRWT>`` typically achieves
 the best compression while still providing a good query performance, and thus, it is
 recommended for very large problem instances.
+Finally, ``RowDiff<RowSparse>`` provides a good trade-off between the query speed and compression performance.
 
 Convert annotation to Rainbowfish
 """""""""""""""""""""""""""""""""
@@ -537,6 +538,7 @@ The conversion to ``RowDiff<Multi-BRWT>`` is done in two steps.
 2.  Transform the diff-transformed columns ``*.row_diff.annodbg`` to ``Multi-BRWT``::
 
         find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \
+                                                        -i graph.dbg \
                                                         --anno-type row_diff_brwt \
                                                         --greedy ...
         metagraph relax_brwt -v -p 18 \
@@ -546,6 +548,19 @@ The conversion to ``RowDiff<Multi-BRWT>`` is done in two steps.
 
     Also see the above paragraph :ref:`to_multi_brwt` for other options.
 
+
+.. _to_row_diff_sparse:
+
+Convert annotation to RowDiff<RowSparse>
+"""""""""""""""""""""""""""""""""""""""""
+The conversion to ``RowDiff<RowSparse>`` is similar to :ref:`to_row_diff_brwt`. The first step is the same.
+In the second step, the diff-transformed columns ``*.row_diff.annodbg`` are converted to ``RowSparse``::
+
+        find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \
+                                                        -i graph.dbg \
+                                                        --anno-type row_diff_sparse
+
+
 .. _transform_count_annotations:
 
 Convert count-aware annotations
@@ -589,8 +604,7 @@ To query a MetaGraph index (graph + annotation) using the command line interface
 
     metagraph query -i graph.dbg \
                     -a annotation.column.annodbg \
-                    --count-kmers \
-                    --discovery-fraction 0.1 \
+                    --min-kmers-fraction-label 0.1 \
                     transcripts_1000.fa
 
 For alignment, see ``metagraph align``.

diff --git a/metagraph/docs/source/sequence_search.rst b/metagraph/docs/source/sequence_search.rst
@@ -33,7 +33,7 @@ column being the corresponding node in the graph (or :code:`0` if not present).
 For less verbose output, the additional :code:`--query-presence` and :code:`--count-kmers`
 flags are available.
 
-- :code:`--query-presence` outputs one line per sequence indicating whether the sequence is present (:code:`1`) or absent (:code:`0`). A sequence is considered to be present if its fraction of present k-mers is at least :code:`d`, as set by the :code:`--discovery-fraction` flag.
+- :code:`--query-presence` outputs one line per sequence indicating whether the sequence is present (:code:`1`) or absent (:code:`0`). A sequence is considered to be present if its fraction of present k-mers is at least :code:`d`, as set by the :code:`--min-kmers-fraction-label` flag.
 - :code:`--count-kmers` outputs one line per sequence in TSV format. The first column is the sequence header, while the second column is of the form :code:`a/b/c`, where :code:`a` is the number of matching k-mers, :code:`b` is the total number of k-mers, and :code:`c` is the total number of unique matching k-mers (where reverse complements are considered to be matching).
 
 Sequence-to-graph alignment