Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into rowdiff
Browse files Browse the repository at this point in the history
  • Loading branch information
hmusta committed Feb 7, 2025
2 parents 3a87651 + 5880e3f commit cca1c69
Show file tree
Hide file tree
Showing 15 changed files with 85 additions and 50 deletions.
9 changes: 1 addition & 8 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,14 +76,7 @@ jobs:
sudo apt-get install libbz2-dev libjemalloc-dev libboost-all-dev libdeflate-dev
if [ "${{ matrix.build_static }}" = "ON" ]; then
# for static builds, compile curl ourselves
git clone https://github.com/curl/curl.git
mkdir curl/_build
cd curl/_build
cmake -DBUILD_SHARED_LIBS=off ..
make -j 2
sudo make install
cd ..
sudo apt-get install libcurl4-openssl-dev
fi
echo "CC=$(which ${{ matrix.cc }})" >> $GITHUB_ENV
Expand Down
7 changes: 5 additions & 2 deletions metagraph/docs/source/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,12 @@ input and deduplicate/count/filter the input k-mers.
For example, the following commands can be used to construct a graph only from k-mers
occurring at least 5 times in the input::

# download the input reads
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR403/SRR403017/SRR403017.fastq.gz

K=31
# count k-mers with KMC
./KMC/kmc -ci5 -t4 -k$K -m5 -fm SRR403017.fasta.gz SRR403017.cutoff_5 ./KMC
./KMC/kmc -ci5 -t4 -k$K -m5 -fq SRR403017.fastq.gz SRR403017.cutoff_5 ./KMC
# build graph with MetaGraph
metagraph build -v -p 4 -k $K -o graph SRR403017.cutoff_5.kmc_pre

Expand All @@ -121,7 +124,7 @@ with pre-counting (see :ref:`annotate_with_precounting`).
.. note::
A weighted graph can also be constructed directly from raw input sequences, without pre-counting with KMC, e.g.,::

metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.fasta.gz
metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.fastq.gz

This should be used when pre-processing with KMC is complicated or impossible, e.g., when indexing protein sequences.

Expand Down
23 changes: 14 additions & 9 deletions metagraph/docs/source/sequence_search.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,22 @@ flags are available.

Sequence-to-graph alignment
^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the :code:`--map` flag is not used, this enables sequence-to-graph alignment, approximately finding the best-matching path in the graph to the query sequence. Alongside this path, this mode also returns an alignment score and a CIGAR string describing the edits needed to transform the spelling of the graph path to the query sequence. An example command may be::

Additional parameters
^^^^^^^^^^^^^^^^^^^^^
metagraph align -i MYGRAPH.dbg MYREADS.fa

Query sequences against the index
---------------------------------
(Experiment discovery)
The output of the query is in TSV format, with one line per query sequence, where the columns are as follows:

Parameters for exact k-mer matching
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. Query name
2. Query sequence
3. Strand
4. Reference sequence (the spelling of the matched path
5. Alignment score
6. Number of exact matches
7. CIGAR string-like alignment summary
8. Number of nucleotides trimmed from the prefix of the reference sequence
9. Reference label matches (:code:`;` separated, present if the :code:`-a` flag is passed)

Parameters for alignment
^^^^^^^^^^^^^^^^^^^^^^^^
An important parameter is the seed length, which can be set with :code:`--align-min-seed-length` and can be shorter than the value of k used to construct the graph.

If an annotator is provided with the :code:`-a` flag, the returned alignments will be label-consistent, meaning that there is at least one label that is shared by all nodes on the path.
3 changes: 2 additions & 1 deletion metagraph/src/cli/clean.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,8 @@ int clean_graph(Config *config) {

uint64_t cutoff
= estimate_min_kmer_abundance(*_graph, *node_weights,
config->num_singleton_kmers);
config->num_singleton_kmers,
config->cleaning_threshold_percentile);

if (cutoff != static_cast<uint64_t>(-1)) {
config->min_unitig_median_kmer_abundance = cutoff;
Expand Down
3 changes: 3 additions & 0 deletions metagraph/src/cli/config/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,8 @@ Config::Config(int argc, char *argv[]) {
min_tip_size = atoi(get_value(i++));
} else if (!strcmp(argv[i], "--prune-unitigs")) {
min_unitig_median_kmer_abundance = atoi(get_value(i++));
} else if (!strcmp(argv[i], "--cleaning-threshold-percentile")) {
cleaning_threshold_percentile = std::stod(get_value(i++));
} else if (!strcmp(argv[i], "--fallback")) {
fallback_abundance_cutoff = atoi(get_value(i++));
} else if (!strcmp(argv[i], "--smoothing-window")) {
Expand Down Expand Up @@ -1000,6 +1002,7 @@ if (advanced) {
fprintf(stderr, "\t --prune-tips [INT] \t\tprune all dead ends shorter than this value [1]\n");
fprintf(stderr, "\t --prune-unitigs [INT] \tprune all unitigs with median k-mer counts smaller\n"
"\t \t\tthan this value (0: auto) [1]\n");
fprintf(stderr, "\t --cleaning-threshold-percentile [FLOAT] the percentile of the k-mer count distribution to set as the cleaning threshold [0.001]\n");
fprintf(stderr, "\t --fallback [INT] \t\tfallback threshold if the automatic one cannot be\n"
"\t \t\tdetermined (-1: disables fallback) [1]\n");
fprintf(stderr, "\n");
Expand Down
1 change: 1 addition & 0 deletions metagraph/src/cli/config/config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ class Config {
double alignment_min_exact_match = 0.7;
double min_fraction = 0.0;
double max_fraction = 1.0;
double cleaning_threshold_percentile = 0.001;
std::vector<double> count_slice_quantiles;
std::vector<double> count_quantiles;

Expand Down
3 changes: 1 addition & 2 deletions metagraph/src/common/algorithms.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,7 @@ namespace utils {
size_t segment_length) {
std::vector<bool> mask(array.size(), false);
size_t last_occurrence
= std::find(array.data(), array.data() + array.size(), label)
- array.data();
= std::find(array.begin(), array.end(), label) - array.begin();

for (size_t i = last_occurrence; i < array.size(); ++i) {
if (array[i] == label)
Expand Down
9 changes: 7 additions & 2 deletions metagraph/src/graph/graph_cleaning.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,15 @@ bool is_unreliable_unitig(const std::vector<SequenceGraph::node_index> &path,


int cleaning_pick_kmer_threshold(const uint64_t *kmer_covg, size_t arrlen,
double fdr_thres,
double *alpha_est_ptr, double *beta_est_ptr,
double *false_pos_ptr, double *false_neg_ptr);

// returns -1 if the automatic estimation fails
uint64_t estimate_min_kmer_abundance(const DeBruijnGraph &graph,
const NodeWeights &node_weights,
uint64_t num_singleton_kmers) {
uint64_t num_singleton_kmers,
double cleaning_threshold_percentile) {
std::vector<uint64_t> hist;
graph.call_nodes([&](auto i) {
uint64_t kmer_count = node_weights[i];
Expand All @@ -60,6 +62,7 @@ uint64_t estimate_min_kmer_abundance(const DeBruijnGraph &graph,

double alpha_est_ptr, beta_est_ptr, false_pos_ptr, false_neg_ptr;
return cleaning_pick_kmer_threshold(hist.data(), hist.size(),
cleaning_threshold_percentile,
&alpha_est_ptr, &beta_est_ptr,
&false_pos_ptr, &false_neg_ptr);
}
Expand Down Expand Up @@ -201,11 +204,13 @@ static inline bool is_cutoff_good(const uint64_t *kmer_covg, size_t arrlen,
*
* @param kmer_covg Histogram of kmer counts at coverages 1,2,.. arrlen-1
* @param arrlen Length of array kmer_covg
* @param fdr_thes Percentile of the kmer coverage histogram to use as the error threshold
* @param alpha_est_ptr If not NULL, used to return estimate for alpha
* @param beta_est_ptr If not NULL, used to return estimate for beta
* @return -1 if no cut-off satisfies FDR, otherwise returns coverage cutoff
*/
int cleaning_pick_kmer_threshold(const uint64_t *kmer_covg, size_t arrlen,
double fdr_thres,
double *alpha_est_ptr, double *beta_est_ptr,
double *false_pos_ptr, double *false_neg_ptr)
{
Expand Down Expand Up @@ -285,7 +290,7 @@ int cleaning_pick_kmer_threshold(const uint64_t *kmer_covg, size_t arrlen,

// Find cutoff by finding first coverage level where errors make up less than
// 0.1% of total coverage
cutoff = pick_cutoff_with_fdr_thresh(e_covg, kmer_covg, arrlen, 0.001);
cutoff = pick_cutoff_with_fdr_thresh(e_covg, kmer_covg, arrlen, fdr_thres);
// printf("A cutoff: %i\n", cutoff);

// Pick highest cutoff that keeps FP < FN
Expand Down
3 changes: 2 additions & 1 deletion metagraph/src/graph/graph_cleaning.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ bool is_unreliable_unitig(const std::vector<SequenceGraph::node_index> &path,

uint64_t estimate_min_kmer_abundance(const DeBruijnGraph &graph,
const NodeWeights &node_weights,
uint64_t num_singleton_kmers = 0);
uint64_t num_singleton_kmers = 0,
double cleaning_threshold_percentile = 0.001);

} // namespace graph
} // namespace mtg
Expand Down
6 changes: 3 additions & 3 deletions metagraph/src/graph/representation/canonical_dbg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ ::map_to_nodes_sequentially(std::string_view sequence,
path.reserve(sequence.size() - get_k() + 1);

if (const auto sshash = std::dynamic_pointer_cast<const DBGSSHash>(graph_)) {
sshash->map_to_nodes_with_rc<>(sequence, [&](node_index node, bool orientation) {
sshash->map_to_nodes_with_rc<true>(sequence, [&](node_index node, bool orientation) {
callback(node && orientation ? reverse_complement(node) : node);
}, terminate);
return;
Expand Down Expand Up @@ -180,7 +180,7 @@ void CanonicalDBG::call_outgoing_kmers(node_index node,
}

if (const auto sshash = std::dynamic_pointer_cast<const DBGSSHash>(graph_)) {
sshash->call_outgoing_kmers_with_rc<>(node, [&](node_index next, char c, bool orientation) {
sshash->call_outgoing_kmers_with_rc<true>(node, [&](node_index next, char c, bool orientation) {
callback(orientation ? reverse_complement(next) : next, c);
});
return;
Expand Down Expand Up @@ -273,7 +273,7 @@ void CanonicalDBG::call_incoming_kmers(node_index node,
}

if (const auto sshash = std::dynamic_pointer_cast<const DBGSSHash>(graph_)) {
sshash->call_incoming_kmers_with_rc<>(node, [&](node_index prev, char c, bool orientation) {
sshash->call_incoming_kmers_with_rc<true>(node, [&](node_index prev, char c, bool orientation) {
callback(orientation ? reverse_complement(prev) : prev, c);
});
return;
Expand Down
54 changes: 38 additions & 16 deletions metagraph/src/graph/representation/hash/dbg_sshash.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include "common/seq_tools/reverse_complement.hpp"
#include "common/threads/threading.hpp"
#include "common/logger.hpp"
#include "common/algorithms.hpp"
#include "kmer/kmer_extractor.hpp"


Expand Down Expand Up @@ -99,32 +100,53 @@ void DBGSSHash::add_sequence(std::string_view sequence,
throw std::logic_error("adding sequences not supported");
}

template <bool with_rc>
void DBGSSHash::map_to_nodes_with_rc(std::string_view sequence,
const std::function<void(node_index, bool)>& callback,
const std::function<bool()>& terminate) const {
if (terminate() || sequence.size() < k_)
template <bool with_rc, class Dict>
void map_to_nodes_with_rc_impl(size_t k,
const Dict &dict,
std::string_view sequence,
const std::function<void(sshash::lookup_result)>& callback,
const std::function<bool()>& terminate) {
size_t n = sequence.size();
if (terminate() || n < k)
return;

if (!num_nodes()) {
for (size_t i = 0; i < sequence.size() - k_ + 1 && !terminate(); ++i) {
callback(npos, false);
if (!dict.size()) {
for (size_t i = 0; i + k <= sequence.size() && !terminate(); ++i) {
callback(sshash::lookup_result());
}
return;
}

using kmer_t = get_kmer_t<Dict>;

std::vector<bool> invalid_char(n);
for (size_t i = 0; i < n; ++i) {
invalid_char[i] = !kmer_t::is_valid(sequence[i]);
}

auto invalid_kmer = utils::drag_and_mark_segments(invalid_char, true, k);

kmer_t uint_kmer = sshash::util::string_to_uint_kmer<kmer_t>(sequence.data(), k - 1);
uint_kmer.pad_char();
for (size_t i = k - 1; i < n && !terminate(); ++i) {
uint_kmer.drop_char();
uint_kmer.kth_char_or(k - 1, kmer_t::char_to_uint(sequence[i]));
callback(invalid_kmer[i] ? sshash::lookup_result()
: dict.lookup_advanced_uint(uint_kmer, with_rc));
}
}

template <bool with_rc>
void DBGSSHash::map_to_nodes_with_rc(std::string_view sequence,
const std::function<void(node_index, bool)>& callback,
const std::function<bool()>& terminate) const {
std::visit([&](const auto &dict) {
using kmer_t = get_kmer_t<decltype(dict)>;
kmer_t uint_kmer = sshash::util::string_to_uint_kmer<kmer_t>(sequence.data(), k_ - 1);
uint_kmer.pad_char();
for (size_t i = k_ - 1; i < sequence.size() && !terminate(); ++i) {
uint_kmer.drop_char();
uint_kmer.kth_char_or(k_ - 1, kmer_t::char_to_uint(sequence[i]));
auto res = dict.lookup_advanced_uint(uint_kmer, with_rc);
map_to_nodes_with_rc_impl<with_rc>(k_, dict, sequence, [&](sshash::lookup_result res) {
callback(sshash_to_graph_index(res.kmer_id), res.kmer_orientation);
}
}, terminate);
}, dict_);
}

template
void DBGSSHash::map_to_nodes_with_rc<true>(std::string_view,
const std::function<void(node_index, bool)>&,
Expand Down
1 change: 1 addition & 0 deletions metagraph/tests/annotation/test_aligner_labeled.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ class LabeledAlignerTest : public ::testing::Test {};

typedef ::testing::Types<std::pair<DBGHashFast, annot::ColumnCompressed<>>,
std::pair<DBGSuccinct, annot::ColumnCompressed<>>,
std::pair<DBGSSHash, annot::ColumnCompressed<>>,
std::pair<DBGHashFast, annot::RowFlatAnnotator>,
std::pair<DBGSuccinct, annot::RowFlatAnnotator>,
std::pair<DBGSuccinct, annot::RowDiffColumnAnnotator>> FewGraphAnnotationPairTypes;
Expand Down
7 changes: 3 additions & 4 deletions metagraph/tests/annotation/test_annotated_dbg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,12 @@
#include "gtest/gtest.h"

#include "../test_helpers.hpp"
#include "../graph/all/test_dbg_helpers.hpp"

#include "common/threads/threading.hpp"
#include "common/vectors/bit_vector_dyn.hpp"
#include "common/vectors/vector_algorithm.hpp"
#include "annotation/representation/column_compressed/annotate_column_compressed.hpp"
#include "graph/representation/bitmap/dbg_bitmap.hpp"
#include "graph/representation/hash/dbg_hash_string.hpp"
#include "graph/representation/hash/dbg_hash_ordered.hpp"
#include "graph/representation/hash/dbg_hash_fast.hpp"

#define protected public
#define private public
Expand Down Expand Up @@ -987,6 +984,7 @@ typedef ::testing::Types<std::pair<DBGBitmap, annot::ColumnCompressed<>>,
std::pair<DBGHashOrdered, annot::ColumnCompressed<>>,
std::pair<DBGHashFast, annot::ColumnCompressed<>>,
std::pair<DBGSuccinct, annot::ColumnCompressed<>>,
std::pair<DBGSSHash, annot::ColumnCompressed<>>,
std::pair<DBGBitmap, annot::RowFlatAnnotator>,
std::pair<DBGHashString, annot::RowFlatAnnotator>,
std::pair<DBGHashOrdered, annot::RowFlatAnnotator>,
Expand Down Expand Up @@ -1016,6 +1014,7 @@ class AnnotatedDBGNoNTest : public ::testing::Test {};
typedef ::testing::Types<std::pair<DBGBitmap, annot::ColumnCompressed<>>,
std::pair<DBGHashOrdered, annot::ColumnCompressed<>>,
std::pair<DBGHashFast, annot::ColumnCompressed<>>,
std::pair<DBGSSHash, annot::ColumnCompressed<>>,
std::pair<DBGBitmap, annot::RowFlatAnnotator>,
std::pair<DBGHashOrdered, annot::RowFlatAnnotator>,
std::pair<DBGHashFast, annot::RowFlatAnnotator>,
Expand Down
1 change: 1 addition & 0 deletions metagraph/tests/annotation/test_annotated_dbg_helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@ template std::unique_ptr<AnnotatedDBG> build_anno_graph<DBGBitmap, ColumnCompres
template std::unique_ptr<AnnotatedDBG> build_anno_graph<DBGHashOrdered, ColumnCompressed<>>(uint64_t, const std::vector<std::string> &, const std::vector<std::string>&, DeBruijnGraph::Mode, bool);
template std::unique_ptr<AnnotatedDBG> build_anno_graph<DBGHashFast, ColumnCompressed<>>(uint64_t, const std::vector<std::string> &, const std::vector<std::string>&, DeBruijnGraph::Mode, bool);
template std::unique_ptr<AnnotatedDBG> build_anno_graph<DBGHashString, ColumnCompressed<>>(uint64_t, const std::vector<std::string> &, const std::vector<std::string>&, DeBruijnGraph::Mode, bool);
template std::unique_ptr<AnnotatedDBG> build_anno_graph<DBGSSHash, ColumnCompressed<>>(uint64_t, const std::vector<std::string> &, const std::vector<std::string>&, DeBruijnGraph::Mode, bool);

template std::unique_ptr<AnnotatedDBG> build_anno_graph<DBGSuccinct, RowFlatAnnotator>(uint64_t, const std::vector<std::string> &, const std::vector<std::string>&, DeBruijnGraph::Mode, bool);
template std::unique_ptr<AnnotatedDBG> build_anno_graph<DBGBitmap, RowFlatAnnotator>(uint64_t, const std::vector<std::string> &, const std::vector<std::string>&, DeBruijnGraph::Mode, bool);
Expand Down
5 changes: 3 additions & 2 deletions metagraph/tests/graph/all/test_dbg_helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ void writeFastaFile(const std::vector<std::string>& sequences, const std::string

fastaFile.close();
}

template <>
std::shared_ptr<DeBruijnGraph>
build_graph<DBGSSHash>(uint64_t k,
Expand All @@ -154,8 +155,8 @@ build_graph<DBGSSHash>(uint64_t k,
if (sequences.empty())
return std::make_shared<DBGSSHash>(k, mode);

// use DBGHashString to get contigs for SSHash
auto string_graph = build_graph<DBGHashString>(k, sequences, mode);
// use DBGHashFast to get contigs for SSHash
auto string_graph = build_graph<DBGHashFast>(k, sequences, mode);

std::vector<std::string> contigs;
size_t num_kmers = 0;
Expand Down

0 comments on commit cca1c69

Please sign in to comment.