Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] compute-optimized MinHash (for small scaled or large cardinalities) #1045

Merged
merged 14 commits into from
Jun 26, 2020

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Jun 24, 2020

As discussed in #1010, for small scaled values or datasets with large cardinality the current implementation starts spending too much time reallocating the internal vector used for keeping mins and abundances. This PR is a first try on creating a compute-optimized MinHash that solves that problem, using BTree structures in Rust (a BTreeSet for mins and a BTreeMap for abunds, but could use only a BTreeMap since the keys are the mins already).

Anecdotally, I used this to calculate signatures for some long reads samples from CAMI 2, and it took 15 minutes instead of 2+ days (and haven't finished when I stopped running) of the current method.

BUT! All the other operations (merge, similarity, etc) are SLOWER. Only insertion ends up being faster. That's why I'm calling it "compute-optimized", because in the other cases it's better to use the current one. (Pending: analysis of gather, which does rebuild the query minhash a lot...)

Fixes #1010

TODO

  • Keep two MinHash impls, this one and the original (this PR is replacing the old one for now)
  • set up proptest using both impls, see if they give same results
  • use this one in build_templates for compute

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Jun 24, 2020

Codecov Report

Merging #1045 into master will decrease coverage by 9.12%.
The diff coverage is 96.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1045      +/-   ##
==========================================
- Coverage   92.42%   83.30%   -9.13%     
==========================================
  Files          72       97      +25     
  Lines        5454     8749    +3295     
==========================================
+ Hits         5041     7288    +2247     
- Misses        413     1461    +1048     
Flag Coverage Δ
#rusttests 68.19% <96.01%> (?)
Impacted Files Coverage Δ
src/core/tests/test.rs 100.00% <ø> (ø)
src/core/src/signature.rs 43.00% <16.66%> (ø)
src/core/src/cmd.rs 87.30% <66.66%> (ø)
src/core/tests/signature.rs 95.83% <90.69%> (ø)
src/core/src/sketch/minhash.rs 94.63% <97.56%> (ø)
src/core/src/index/sbt/mhbt.rs 63.70% <100.00%> (ø)
src/core/src/sketch/nodegraph.rs 84.19% <100.00%> (ø)
src/core/tests/minhash.rs 99.52% <100.00%> (ø)
src/core/src/ffi/mod.rs 0.00% <0.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7c07eb...5fa1e86. Read the comment docs.

@luizirber
Copy link
Member Author

luizirber commented Jun 25, 2020

Issues to punt from this PR:

@luizirber luizirber changed the title [WIP] compute-optimized MinHash (for small scaled or large cardinalities) [MRG] compute-optimized MinHash (for small scaled or large cardinalities) Jun 25, 2020
@luizirber
Copy link
Member Author

This is ready for review @ctb @olgabot @bluegenes

It is still missing coverage on the Rust side (since most of those methods end up not being exposed to Python at all), I'll set them up as more oracle-based property testing (which will also raise the Vec-based MinHash coverage)

@ctb
Copy link
Contributor

ctb commented Jun 26, 2020

Huh. Those are a lot of issues to punt from this PR :).

This doesn't touch the Python API or command-line interface. I'm not sure how to review it because of that! I'm fine with merging it, I guess?

@luizirber
Copy link
Member Author

Huh. Those are a lot of issues to punt from this PR :).

I wanted to avoid making it even more massive with a bunch of random changes... And the refactors are easier later, because they will have a baseline that already works.

(And the add_sequence/add_protein was more a reminder for something that I noticed while doing this, but it's an ortogonal PR to this one).

This doesn't touch the Python API or command-line interface. I'm not sure how to review it because of that! I'm fine with merging it, I guess?

I think this is the relevant info: https://github.com/luizirber/sourmash_resources/blob/03ca7cea8df4640f83fcfa3359ce0be9ce0abab1/README.md#compute

Performance didn't change for a regular use case, and it improved a lot for small scaled or large cardinalities. The mem consumption can be lowered (by using only a BTreeMap instead of BTreeSet+BTreeMap in the abundance case), but that can come later (it doesn't change the public API).

I'll bring up the coverage and test more on the Rust side, and then merge. And probably cut 3.3.2 before Monday?

@ctb
Copy link
Contributor

ctb commented Jun 26, 2020

sounds good to me. There are a few PRs I'd like to make it into a release but I guess we can alway cut another one soon after :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dealing with scaled=1
2 participants