Difference between searching from reads and assembly #2750

jodyphelan · 2023-09-06T09:09:56Z

I'm looking into using sourmash to find the max ANI of a new sample to existing sequences. I can do this either straight from the reads or assembling first and then searching using the assembly.

I noticed the ANI drops considerably (outside of what might be considered the same species) from 99% using the assembly to 93% using the reads. Is this expected?

Here is an example for reference:

# download files
wget wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR211/000/SRR21133400/SRR21133400_*.fastq.gz
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001373395.1/download?include_annotation_type=GENOME_FASTA&filename=GCF_001373395.1.zip" -H "Accept: application/zip"
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_022374895.2/download?include_annotation_type=GENOME_FASTA&filename=GCF_022374895.2.zip" -H "Accept: application/zip"
unzip -o GCF_001373395.1.zip
unzip -o GCF_022374895.2.zip

# create sketches
sourmash sketch dna --merge reads -o reads.sig SRR21133400_*
sourmash sketch dna -o GCF_001373395.1.sig ncbi_dataset/data/GCF_001373395.1/GCF_001373395.1_PRJEB8430_assembly_1_genomic.fna
sourmash sketch dna -o GCF_022374895.2.sig ncbi_dataset/data/GCF_022374895.2/GCF_022374895.2_ASM2237489v2_genomic.fna

# perform searches
sourmash search reads.sig GCF_001373395.1.sig --containment -o reads.csv --estimate-ani
sourmash search GCF_022374895.2.sig GCF_001373395.1.sig --containment -o asm.csv --estimate-ani
head *.csv

The text was updated successfully, but these errors were encountered:

ctb · 2023-09-06T13:07:29Z

hey @jodyphelan thanks for the commands!!

my guess here is that the assembly is removing a lot of sequence, probably by collapsing strain variation. This is a well-known problem that we are working to measure in an ongoing project. This does seem like an extreme case tho!

The simplest way to confirm is to look at the k-mer containment values directly. They should mirror what you see with ANI, but stronger, because k-mer containment scales exponentially with ANI.

An alternative/addition would be to do mapping - the 99% reads should map just fine to the reference.

@bluegenes any thoughts?

super cool stuff - if you nail down any details I'd love to hear about them!

jodyphelan · 2023-09-06T15:48:56Z

Hi @ctb,

Thanks for your answer, super informative. Based on that I tested to see if you track the abundance and then filter out hashes with low abundance and it seems to push up the ANI to 99%. I was wondering if perhaps filtering of low-abundance hashes would be a useful feature to build in? No worries if it isn't a priority, I can use the script for now.

Here if a working example building on the previous comment. Here is the python script to perform the filtering of the signature (using a min value of 5):

import json
import sys

input = sys.argv[1]
output = sys.argv[2]

data = json.load(open(input))

num_mins = len(data[0]['signatures'][0]['mins'])

mins = [data[0]['signatures'][0]['mins'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]
abunds = [data[0]['signatures'][0]['abundances'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]

data[0]['signatures'][0]['mins'] = mins
data[0]['signatures'][0]['abundances'] = abunds

json.dump(data, open(output, 'w'))

Then I ran the following sketch command with a lower scaling value (as some of the hashes will be removed later).

sourmash sketch dna -p scaled=100,abund --merge reads -o reads.abund.sig SRR21133400_*

Performed the filtering with the script above. This brings the number of hashes down from 517328 to 61343.

python filter-sig.py reads.abund.sig reads.abund.filt.sig

# Check the difference 
sourmash sig describe reads.abund.sig
sourmash sig describe reads.abund.filt.sig

Then running search which gives an ANI estimate of 99%

sourmash search reads.abund.filt.sig GCF_001373395.1.sig --containment -o reads.filt.csv --estimate-ani --ignore-abundance
head *.csv

ctb · 2023-09-06T16:05:41Z

more later, but see: sourmash sig filter ;)

jodyphelan · 2023-09-06T16:22:21Z

Oh wow, I need to learn how to read the manual!

ctb · 2023-09-06T16:34:29Z

nah, no worries - we've added a lot of functionality over the years and it's easy to lose track even for us 😆

jodyphelan · 2023-09-06T16:48:42Z

haha ok!

Just for reference, I'm planning to use this as a method to quickly ascertain what species of Mycobacterium we have before we go down the mapping/variant calling and all that and it seems like it does it perfectly, so thanks again for a great tool!

ctb · 2023-09-17T15:07:47Z

just to put a cap on this - I made a note to revisit based on the last comment, but now upon re-reading I see:

assembly had ANI of 99% (based on sketching)
reads had estimated ANI of 93% (based on sketching)
filtering out low-abundance hashes from the sketches made the reads ANI match the assembly ANI

the straightforward interpretation is that assembly is indeed assembling the high abundance sequences, so filtering out low abundance k-mers makes the reads look like the assembly.

I had misread the initial question to show that assembly had an ANI of 93%, and that made me wonder. but all seems good now!

Note that search is not that informative compared to prefetch and prefetch does exactly what you want here - search by containment :).

I would also like to take this time to suggest that you use gather instead of search --containment, which should give you the "right" species of mycobacterium with very high certainty and a lot less fuss. If you try it out and get different results I'd be VERY interested to hear about it.

Last but not least you might be interested in this discussion about k-mer trimming: #2122 - no hurry or anything, but your feedback would be very welcome :)

jodyphelan closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between searching from reads and assembly #2750

Difference between searching from reads and assembly #2750

jodyphelan commented Sep 6, 2023

ctb commented Sep 6, 2023

jodyphelan commented Sep 6, 2023 •

edited

Loading

ctb commented Sep 6, 2023

jodyphelan commented Sep 6, 2023

ctb commented Sep 6, 2023

jodyphelan commented Sep 6, 2023

ctb commented Sep 17, 2023

Difference between searching from reads and assembly #2750

Difference between searching from reads and assembly #2750

Comments

jodyphelan commented Sep 6, 2023

ctb commented Sep 6, 2023

jodyphelan commented Sep 6, 2023 • edited Loading

ctb commented Sep 6, 2023

jodyphelan commented Sep 6, 2023

ctb commented Sep 6, 2023

jodyphelan commented Sep 6, 2023

ctb commented Sep 17, 2023

jodyphelan commented Sep 6, 2023 •

edited

Loading