Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Prototype: count k-mers in banded mode in a single pass #1818

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

standage
Copy link
Member

This PR is a Python prototype of the idea described in #1800. In brief, the idea is to count k-mers in the input reads and populate N counttables (each with ≈1/N of the k-mers) in a single pass.

This prototype implements a k-mer buffer that, when "full", writes to disk and flushes/resets. Once all the input reads are processed, the temp files are read back from disk 1-by-1 to populate a Counttable for each band.

The prototype works as expected, but the performance is painfully slow on large data sets. It would need a Cython or C++ implementation with threading support to be viable methinks. I would welcome any thoughts. cc @camillescott @betatim @ctb @luizirber

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only backwards-compatible
    additions are allowed without a major version increment. Changing file
    formats also requires a major version number increment.
  • For substantial changes or changes to the command-line interface, is it
    documented in CHANGELOG.md? See keepachangelog
    for more details.
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)

@standage
Copy link
Member Author

standage commented Nov 16, 2017

curl -L https://osf.io/f5trh/download?version=1 -o reads.fq.gz
sandbox/count-band-single-pass.py \
    --ksize 25 --num-bands 4 --buffersize 1e7 --memory 1e7 \
    --outfmt "bandtest{}.ct" \
    reads.fq.gz

The procedure above will result in the same output as the Python commands below, which do 4 passes over the reads.

for i in range(4):
    counts = khmer.Counttable(25, 1e7 / 4, 4)
    counts.consume_seqfile_banding('reads.fq.gz', 4, i)
    of = 'bandtest{}.ct'.format(i)
    counts.save(of)

@betatim
Copy link
Member

betatim commented Nov 28, 2017

For my education: why split into bands but then create N counttables? I understand "let's make N bands because we have far too much data anyway" but that doesn't fit with then making N bands. Tim is confused.com

@standage
Copy link
Member Author

We are investigating some use cases where we compute only on a single band, but there is at least one use case (motivated by kevlar and variant discovery) where we want to compute on each band independently. In this case, the banding provides us a way of randomly sharding the k-mers into N disjoint sets, each of which only requires 1/N of the memory required to store the entire data set.

This PR is looking for an efficient way to compute counts for all bands in a single first pass, rather than N passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants