Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: avoid clones by using new Signature::try_into() -> KmerMinHash #471

Merged
merged 138 commits into from
Oct 15, 2024

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Oct 13, 2024

NOTE: PR into #430. Includes #467.

gather comparison

Comparing latest MultiCollection benchmarks from #430 to this PR, for SRR1976948, we see:

prefix s (#430) s (this PR) max_rss (#430) max_rss (this PR)
fastmultigather_rocksdb 151.352 160.3 590.32 487.6
fastgather 166.86 118.5 8287.86 7786.2
fastmultigather 465.033 387.1 25236.7 20327.2

manysearch comparison

Comparing to manysearch benchmarking results in #463, we see:

plugin version time memory
v0.8.6 814s 18.3 GB
v0.9.0 1332s 27.1 GB
v0.9.1 1361s 22.4 GB
v0.9.5 1534s 21.1 GB
v0.9.6 17523s 27.2 GB
unreleased w/#430 1428.0 18.6 GB
unreleased w/430/471 1096.7 16.5 GB
unreleased w/430/471* 559.7 38.3 GB

* this run used sig.zip files and manifest CSVs in the benchmarking.

So this is a major improvement in both time and memory over, really, everything since v0.8.6!

The last line shows that there is now a major benefit to using lists of .sig.zip files and/or manifest CSVs that point at zip files: basically, only the relevant sketch (k=21/31/51 or whatever) is being loaded, and there's no double-loading to generate the manifest (as introduced in 0.9.0). This was supposed to be one of the major benefits of #430 so I'm very happy about these results showing that's the case!! The extra memory usage is presumably because so much more of the time is being spent in calculations vs loading.

ctb added 30 commits August 17, 2024 10:14
#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc
@ctb ctb changed the title WIP: avoid clones by using new Signature::try_into() -> KmerMinHash MRG: avoid clones by using new Signature::try_into() -> KmerMinHash Oct 13, 2024
@ctb
Copy link
Collaborator Author

ctb commented Oct 13, 2024

Ready for review! I'd like to do the merge tho, since it should be done after #430 and (imho) #467.

ctb added a commit to sourmash-bio/sourmash that referenced this pull request Oct 15, 2024
…ing (#3352)

This PR builds on the refactoring in #3342 to do less downsampling and
also avoids doing intersections twice (per #3196).

Benchmarks in
sourmash-bio/sourmash_plugin_branchwater#471 are
pretty astonishing...

Fixes #3196

---------

Co-authored-by: Luiz Irber <[email protected]>
Base automatically changed from ctb_misc2 to main October 15, 2024 21:50
@ctb ctb merged commit a78e145 into main Oct 15, 2024
1 check passed
@ctb ctb deleted the avoid_clones branch October 15, 2024 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants