-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] compute-optimized MinHash (for small scaled or large cardinalities) #1045
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1045 +/- ##
==========================================
- Coverage 92.42% 83.30% -9.13%
==========================================
Files 72 97 +25
Lines 5454 8749 +3295
==========================================
+ Hits 5041 7288 +2247
- Misses 413 1461 +1048
Continue to review full report at Codecov.
|
Issues to punt from this PR:
|
This is ready for review @ctb @olgabot @bluegenes It is still missing coverage on the Rust side (since most of those methods end up not being exposed to Python at all), I'll set them up as more oracle-based property testing (which will also raise the Vec-based MinHash coverage) |
Huh. Those are a lot of issues to punt from this PR :). This doesn't touch the Python API or command-line interface. I'm not sure how to review it because of that! I'm fine with merging it, I guess? |
I wanted to avoid making it even more massive with a bunch of random changes... And the refactors are easier later, because they will have a baseline that already works. (And the
I think this is the relevant info: https://github.com/luizirber/sourmash_resources/blob/03ca7cea8df4640f83fcfa3359ce0be9ce0abab1/README.md#compute Performance didn't change for a regular use case, and it improved a lot for small scaled or large cardinalities. The mem consumption can be lowered (by using only a I'll bring up the coverage and test more on the Rust side, and then merge. And probably cut |
sounds good to me. There are a few PRs I'd like to make it into a release but I guess we can alway cut another one soon after :) |
As discussed in #1010, for small scaled values or datasets with large cardinality the current implementation starts spending too much time reallocating the internal vector used for keeping
mins
andabundances
. This PR is a first try on creating a compute-optimized MinHash that solves that problem, usingBTree
structures in Rust (aBTreeSet
for mins and aBTreeMap
for abunds, but could use only aBTreeMap
since the keys are the mins already).Anecdotally, I used this to calculate signatures for some long reads samples from CAMI 2, and it took 15 minutes instead of 2+ days (and haven't finished when I stopped running) of the current method.
BUT! All the other operations (merge, similarity, etc) are SLOWER. Only insertion ends up being faster. That's why I'm calling it "compute-optimized", because in the other cases it's better to use the current one. (Pending: analysis of
gather
, which does rebuild the query minhash a lot...)Fixes #1010
TODO
proptest
using both impls, see if they give same resultsbuild_templates
for computeChecklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?