-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between searching from reads and assembly #2750
Comments
hey @jodyphelan thanks for the commands!! my guess here is that the assembly is removing a lot of sequence, probably by collapsing strain variation. This is a well-known problem that we are working to measure in an ongoing project. This does seem like an extreme case tho! The simplest way to confirm is to look at the k-mer containment values directly. They should mirror what you see with ANI, but stronger, because k-mer containment scales exponentially with ANI. An alternative/addition would be to do mapping - the 99% reads should map just fine to the reference. @bluegenes any thoughts? super cool stuff - if you nail down any details I'd love to hear about them! |
Hi @ctb, Thanks for your answer, super informative. Based on that I tested to see if you track the abundance and then filter out hashes with low abundance and it seems to push up the ANI to 99%. I was wondering if perhaps filtering of low-abundance hashes would be a useful feature to build in? No worries if it isn't a priority, I can use the script for now. Here if a working example building on the previous comment. Here is the python script to perform the filtering of the signature (using a min value of 5): import json
import sys
input = sys.argv[1]
output = sys.argv[2]
data = json.load(open(input))
num_mins = len(data[0]['signatures'][0]['mins'])
mins = [data[0]['signatures'][0]['mins'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]
abunds = [data[0]['signatures'][0]['abundances'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]
data[0]['signatures'][0]['mins'] = mins
data[0]['signatures'][0]['abundances'] = abunds
json.dump(data, open(output, 'w')) Then I ran the following sketch command with a lower scaling value (as some of the hashes will be removed later).
Performed the filtering with the script above. This brings the number of hashes down from 517328 to 61343.
Then running search which gives an ANI estimate of 99%
|
more later, but see: |
Oh wow, I need to learn how to read the manual! |
nah, no worries - we've added a lot of functionality over the years and it's easy to lose track even for us 😆 |
haha ok! Just for reference, I'm planning to use this as a method to quickly ascertain what species of Mycobacterium we have before we go down the mapping/variant calling and all that and it seems like it does it perfectly, so thanks again for a great tool! |
just to put a cap on this - I made a note to revisit based on the last comment, but now upon re-reading I see:
the straightforward interpretation is that assembly is indeed assembling the high abundance sequences, so filtering out low abundance k-mers makes the reads look like the assembly. I had misread the initial question to show that assembly had an ANI of 93%, and that made me wonder. but all seems good now! Note that I would also like to take this time to suggest that you use gather instead of Last but not least you might be interested in this discussion about k-mer trimming: #2122 - no hurry or anything, but your feedback would be very welcome :) |
I'm looking into using sourmash to find the max ANI of a new sample to existing sequences. I can do this either straight from the reads or assembling first and then searching using the assembly.
I noticed the ANI drops considerably (outside of what might be considered the same species) from 99% using the assembly to 93% using the reads. Is this expected?
Here is an example for reference:
The text was updated successfully, but these errors were encountered: