Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between searching from reads and assembly #2750

Closed
jodyphelan opened this issue Sep 6, 2023 · 7 comments
Closed

Difference between searching from reads and assembly #2750

jodyphelan opened this issue Sep 6, 2023 · 7 comments

Comments

@jodyphelan
Copy link

I'm looking into using sourmash to find the max ANI of a new sample to existing sequences. I can do this either straight from the reads or assembling first and then searching using the assembly.

I noticed the ANI drops considerably (outside of what might be considered the same species) from 99% using the assembly to 93% using the reads. Is this expected?

Here is an example for reference:

# download files
wget wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR211/000/SRR21133400/SRR21133400_*.fastq.gz
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001373395.1/download?include_annotation_type=GENOME_FASTA&filename=GCF_001373395.1.zip" -H "Accept: application/zip"
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_022374895.2/download?include_annotation_type=GENOME_FASTA&filename=GCF_022374895.2.zip" -H "Accept: application/zip"
unzip -o GCF_001373395.1.zip
unzip -o GCF_022374895.2.zip

# create sketches
sourmash sketch dna --merge reads -o reads.sig SRR21133400_*
sourmash sketch dna -o GCF_001373395.1.sig ncbi_dataset/data/GCF_001373395.1/GCF_001373395.1_PRJEB8430_assembly_1_genomic.fna
sourmash sketch dna -o GCF_022374895.2.sig ncbi_dataset/data/GCF_022374895.2/GCF_022374895.2_ASM2237489v2_genomic.fna

# perform searches
sourmash search reads.sig GCF_001373395.1.sig --containment -o reads.csv --estimate-ani
sourmash search GCF_022374895.2.sig GCF_001373395.1.sig --containment -o asm.csv --estimate-ani
head *.csv
@ctb
Copy link
Contributor

ctb commented Sep 6, 2023

hey @jodyphelan thanks for the commands!!

my guess here is that the assembly is removing a lot of sequence, probably by collapsing strain variation. This is a well-known problem that we are working to measure in an ongoing project. This does seem like an extreme case tho!

The simplest way to confirm is to look at the k-mer containment values directly. They should mirror what you see with ANI, but stronger, because k-mer containment scales exponentially with ANI.

An alternative/addition would be to do mapping - the 99% reads should map just fine to the reference.

@bluegenes any thoughts?

super cool stuff - if you nail down any details I'd love to hear about them!

@jodyphelan
Copy link
Author

jodyphelan commented Sep 6, 2023

Hi @ctb,

Thanks for your answer, super informative. Based on that I tested to see if you track the abundance and then filter out hashes with low abundance and it seems to push up the ANI to 99%. I was wondering if perhaps filtering of low-abundance hashes would be a useful feature to build in? No worries if it isn't a priority, I can use the script for now.

Here if a working example building on the previous comment. Here is the python script to perform the filtering of the signature (using a min value of 5):

import json
import sys

input = sys.argv[1]
output = sys.argv[2]

data = json.load(open(input))

num_mins = len(data[0]['signatures'][0]['mins'])

mins = [data[0]['signatures'][0]['mins'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]
abunds = [data[0]['signatures'][0]['abundances'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]

data[0]['signatures'][0]['mins'] = mins
data[0]['signatures'][0]['abundances'] = abunds

json.dump(data, open(output, 'w'))

Then I ran the following sketch command with a lower scaling value (as some of the hashes will be removed later).

sourmash sketch dna -p scaled=100,abund --merge reads -o reads.abund.sig SRR21133400_*

Performed the filtering with the script above. This brings the number of hashes down from 517328 to 61343.

python filter-sig.py reads.abund.sig reads.abund.filt.sig

# Check the difference 
sourmash sig describe reads.abund.sig
sourmash sig describe reads.abund.filt.sig

Then running search which gives an ANI estimate of 99%

sourmash search reads.abund.filt.sig GCF_001373395.1.sig --containment -o reads.filt.csv --estimate-ani --ignore-abundance
head *.csv

@ctb
Copy link
Contributor

ctb commented Sep 6, 2023

more later, but see: sourmash sig filter ;)

@jodyphelan
Copy link
Author

Oh wow, I need to learn how to read the manual!

@ctb
Copy link
Contributor

ctb commented Sep 6, 2023

nah, no worries - we've added a lot of functionality over the years and it's easy to lose track even for us 😆

@jodyphelan
Copy link
Author

haha ok!

Just for reference, I'm planning to use this as a method to quickly ascertain what species of Mycobacterium we have before we go down the mapping/variant calling and all that and it seems like it does it perfectly, so thanks again for a great tool!

@ctb
Copy link
Contributor

ctb commented Sep 17, 2023

just to put a cap on this - I made a note to revisit based on the last comment, but now upon re-reading I see:

  • assembly had ANI of 99% (based on sketching)
  • reads had estimated ANI of 93% (based on sketching)
  • filtering out low-abundance hashes from the sketches made the reads ANI match the assembly ANI

the straightforward interpretation is that assembly is indeed assembling the high abundance sequences, so filtering out low abundance k-mers makes the reads look like the assembly.

I had misread the initial question to show that assembly had an ANI of 93%, and that made me wonder. but all seems good now!

Note that search is not that informative compared to prefetch and prefetch does exactly what you want here - search by containment :).

I would also like to take this time to suggest that you use gather instead of search --containment, which should give you the "right" species of mycobacterium with very high certainty and a lot less fuss. If you try it out and get different results I'd be VERY interested to hear about it.

Last but not least you might be interested in this discussion about k-mer trimming: #2122 - no hurry or anything, but your feedback would be very welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants