[WIP] Prototype: count k-mers in banded mode in a single pass #1818
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a Python prototype of the idea described in #1800. In brief, the idea is to count k-mers in the input reads and populate N counttables (each with ≈1/N of the k-mers) in a single pass.
This prototype implements a k-mer buffer that, when "full", writes to disk and flushes/resets. Once all the input reads are processed, the temp files are read back from disk 1-by-1 to populate a Counttable for each band.
The prototype works as expected, but the performance is painfully slow on large data sets. It would need a Cython or C++ implementation with threading support to be viable methinks. I would welcome any thoughts. cc @camillescott @betatim @ctb @luizirber
make test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
documented in
CHANGELOG.md
? See keepachangelogfor more details.
changes were made?
tested for streaming IO?)