-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clusty is slow to read long inputs (with named sequences) #2
Comments
Hello! We manged to run Clusty on distance tables having tens of gigabytes so it seems there is some issue which made Clusty hung on your dataset. How large is your data file and how many distances it contains? Could you please provide me with at least part of it? Best, |
I don't have the original input anymore, but a filtered version (which Clusty is also taking a long time to read) is ~34GB with 401,294,724 lines (not counting the header). |
Hello! Thank you for providing me with the data. The reason why Clusty is so slow in your application is that your input contains named sequences, while our Vclust pipeline uses numeric identifiers (well, in fact your identifiers are 128-bit hexadecimal numbers but they are treated by Clusty as strings). Sequence names have to be hashed which introduces significant overhead. Since we didn't optimize Clusty for this kind of inputs, there is space for improvement. I'll let you know once the fix is done. Best, |
Hello! Best, |
It is much faster now. Thank you @agudys! The results, however, don't get those that I get with |
I generated a large similarity table from ~1 million genomes using sourmash'es branchwater and tried to cluster it with Clusty. Clusty didn't finish reading the input after 6 hours, while pyLeiden finished reading everything within ~30 min.
Is there a reason Clusty is taking so much time to read the input?
The text was updated successfully, but these errors were encountered: