Improved classification time with KMC #15

dkoslicki · 2020-02-06T20:54:33Z

When running StreamingQueryDNADatabase.py, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:

Dump all the training database sketch k-mers using KMC (like it's done here
Use KMC to count the k-mers in the sample
Intersect these with the k-mers in the training database sketches and dump these to a file
Reformat these dumped k-mers into a FASTA-looking file
Feed that into StreamingQueryDNADatabase.py

Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.

The text was updated successfully, but these errors were encountered:

dkoslicki · 2020-02-07T04:24:59Z

Could be nice to write this as an optional script that will be run before the StreamingQueryDNADatabase.py script and then compare runtime performance with/without this pre-processing (as you’ll also not need the Bloom filter pre-filter). You’ll lose the ability to keep track of k-mer counts (which is the focus of #21), but it would be for a different use case (eg. Metalign). @IsaacT1123 This might be a good first thing to cut your teeth on and could be a component of the eventual CMash publication mentioned in #20 .

IsaacT1123 · 2020-02-07T12:34:53Z

I’d be happy to work on this enhancement after getting CMash installed and creating a retraining script.

dkoslicki · 2020-02-07T21:01:30Z

Excellent! Go ahead and make a new branch (named something like KMC or Issue15 or the like), and I'll assign this issue to you

dkoslicki · 2020-03-18T22:38:03Z

@IsaacT1123 a friendly reminder to reference this issue in your commits! It's a great way to keep track of changes being made, and is also a visible history of your contributions to FOSS projects!

IsaacT1123 · 2020-03-19T02:34:51Z

I’ll be sure to do so from now on- thank you for reminding me

…f relative paths on Class __init__ #15

…n to run. #15

…ssfully #15

…ructed later on #15

…checking #15

…dification #15

dkoslicki · 2020-03-31T19:01:20Z

Still in the process of diagnosing the issue. Will need to return to it later @dkoslicki

…branch since a lot will get messed up.

dkoslicki · 2020-03-31T21:25:43Z

Definitely an issue with the kmc_dump and then the intersection:
with

/usr/bin/time python ${scriptsDir}/MakeStreamingDNADatabase.py filenames.txt TrainingDatabase.h5 -k 10
/usr/bin/time python ${scriptsDir}/StreamingQueryDNADatabase.py ${testOrganism} TrainingDatabase.h5 results.csv 10-10-1 --sensitive --intersect -c 0

and then checking:

grep -c '>' 10mers_intersection_dump.fa

you get 3 10-mers, which is totally not accurate since the testOrganism itself has way more than 3 10-mers.

dkoslicki · 2020-03-31T21:42:54Z

Indeed, it might be the kmc_tools simple intersect that's the problem since:

comm -1 -2 <(kmc_dump reads_10mers_dump /dev/fd/1 | cut -f1 | sort) <(grep -v '>' TrainingDatabase_dump.fa | sort) | wc -l

gives 689. I.e. if you dump the 10mers from the reads, dump the training database 10-mers, and count the ones in common, you get much more than 3...

…e replicate_in_python.py after the FIXME for #15

… but kmc was called with -fa instead of -fm. This at least dumps more k-mers (544 instead of 3). #15 will now test if this is an accurate dump

dkoslicki · 2020-04-01T22:14:24Z

@IsaacT1123 So the -fa to -fm in Intersect.count_training_kmers() fixed one issue, so now tests work if you use a single k-mer size, but still don't pass if you use multiple k-mer sizes...

…licate_in_python.py #15

…ection(), so appears to be a problem further down the line #15

dkoslicki · 2020-04-02T19:24:16Z

@IsaacT1123 Finally figured out the issue, and it's "obvious" now that I see what happens: When intersecting the reads 21-mers with the training database 21-mers, you miss out on some k-mers with k<21 that are in the reads (those that aren't prefixes of some database 21-mer). Eg. the query file has the 11-mer "TGCCCTGTGGC" in it, but since this isn't a prefix of a 21-mer in the training database, it isn't a prefix of any 21-mer included in the 21mers_intersection_dump.fa file. Hence these smaller k-mer sizes (k<21) are undercounted.
i.e. The KMC prefilter at this time should only be used when the training database is constructed with a k-mer size of K and the StreamingQueryDNADatabase.py is called with the k-range of `K-K-1.
This explains why the above comment noted that things work for a single k-mer size.

I will need to think about if the KMC approach will work for multiple k-mer sizes. Thankfully, applications like Metalign (that only use a single k-mer size), are unaffected by this issue.

luizirber · 2020-04-02T20:03:04Z

i.e. The KMC prefilter at this time should only be used when the training database is constructed with a k-mer size of K and the StreamingQueryDNADatabase.py is called with the k-range of `K-K-1.

I was using both 21-51-10 and K-K-1 in my tests and was seeing ~40% differences, I will stick with K-K-1 for the near future.

(even so, if you look at the table in this section there are some datasets with big differences to the ground truth...)

…reads through. #15

dkoslicki · 2020-04-06T19:33:06Z

@luizirber

i.e. The KMC prefilter at this time should only be used when the training database is constructed with a k-mer size of K and the StreamingQueryDNADatabase.py is called with the k-range of `K-K-1.

I was using both 21-51-10 and K-K-1 in my tests and was seeing ~40% differences, I will stick with K-K-1 for the near future.

(even so, if you look at the table in this section there are some datasets with big differences to the ground truth...)

Please do note that that it is indeed expected that when using StreamingQueryDNADatabase.py <snip> bot-top-diff, then the containment index estimates for k-mer sizes bot <= k < top will indeed be off (sometimes significantly so) while k-mer size k=top will be at least as accurate as the manuscript bloom-filter approach.

Recall that last year when I presented this approach, I mentioned how when you take the prefix of a k-mer, it is not a truly random sample, but a biased sample. The theory has been worked out by how much this affects the containment index due to the bias, but unfortunately the theory says that the bias factor is data dependent. Hence why @ShaopengLiu1 is working on #20 to test this on realistic data sets. He too is seeing difference of roughly 10-30%, but it complete depends on what the actual containment index is (eg: very small or very high true containment index means k-mer sizes with bot <= k < top are better estimates). Hence also the proviso in the (very out of date) Readme.md. And hence also why this streaming, ternary search tree, multi-kmer size approach has not been published yet (only the non-streaming, bloom-filter approach). Details are still being worked out for the intermediate k-mer sizes bot <= k < top.

Since Sourmash is estimating a single k-mer size containment index, if you want an "apples-to-apples" comparison, you should indeed be using K-K-1 only when using StreamingQueryDNADatabase.py.
Also, keep in mind that while Sourmash is using scaled hashes (iirc), StreamingQueryDNADatabase.py results will only be as accurate as the number of hashes you chose via -n in MakeStreamingDNADatabase.py. The default value of n=500 is quite small, and I would recommend always using as large of an n as you can tolerate (resource-wise). Eg. Metalign uses -n 1000 or -n 2000. As indicated in equation (2.7) (which is a lower bound on the streaming, multi-kmer size approach when k-mer size = top) the containment index accuracy exponentially improves with increasing n (note that equation uses k instead of n, but it's still up there in the exponent).

luizirber · 2020-04-06T20:41:01Z

1. Please do note that that it is indeed _expected_ that when using `StreamingQueryDNADatabase.py <snip> bot-top-diff`, then the containment index estimates for k-mer sizes `bot <= k < top` will indeed be off (sometimes significantly so) while k-mer size `k=top` will be at least as accurate as the [manuscript](https://doi.org/10.1016/j.amc.2019.02.018) bloom-filter approach.

I'll also run for "cmash paper" (since I only have analysis with the new streaming cmash), and see how they compare.

2. Also, keep in mind that while `Sourmash` is using scaled hashes (iirc), `StreamingQueryDNADatabase.py` results will _only_ be as accurate as the number of hashes you chose via `-n` in `MakeStreamingDNADatabase.py`. The default value of `n=500` is quite small, and I would recommend _always_ using as large of an `n` as you can tolerate (resource-wise). Eg. `Metalign` uses `-n 1000` or `-n 2000`. As indicated in [equation (2.7)](https://www.biorxiv.org/content/10.1101/184150v1.full.pdf) (which is a lower bound on the streaming, multi-kmer size approach when `k-mer size = top`) the containment index accuracy exponentially improves with increasing `n` (note that equation uses `k` instead of `n`, but it's still up there in the exponent).

I did use -n 1000 and -n 100000 (to match the "high-accuracy" in this section of the mash screen paper. Interestingly, when the true containment is very small (0.01 to 0.2) CMash is calculating it correctly, but when it's close to 1 it is underestimating quite a bit (worst case is 0.51, when it should be 0.995).

luizirber · 2020-04-07T00:44:19Z

I'll also run for "cmash paper" (since I only have analysis with the new streaming cmash), and see how they compare.

Done, and...

I did use -n 1000 and -n 100000 (to match the "high-accuracy" in this section of the mash screen paper. Interestingly, when the true containment is very small (0.01 to 0.2) CMash is calculating it correctly, but when it's close to 1 it is underestimating quite a bit (worst case is 0.51, when it should be 0.995).

For "cmash paper" results are <1% from ground truth (when n=100000), and <5% for n=1000. I'm going to use these numbers for my comparisons.

dkoslicki · 2020-04-07T04:15:21Z

@luizirber I would be interested to see when non-“CMash paper” is underestimating badly (as in, the worst case of 0.51 when it should be 0.995). Maybe you could send me the training/testing data? Might be a usage issue (due to the constant flux of the streaming version of CMash), or could be some other bug or issue you’ve identified (since at least theoretically, high containment index is when this approach should work best).

luizirber · 2020-04-07T23:00:57Z

I would be interested to see when non-“CMash paper” is underestimating badly (as in, the worst case of 0.51 when it should be 0.995). Maybe you could send me the training/testing data?

This is all coming from my thesis repo, here's a table with the containments:
https://nbviewer.jupyter.org/github/luizirber/phd/blob/491bc7b/experiments/smol_gather/notebooks/analysis.ipynb#CMash-with-TST-(new-version)
and for downloading the data and generating the results you can do

$ git clone https://github.com/luizirber/phd && cd phd
$ conda env create --force --file environment.yml
$ conda activate thesis
$ cd experiments/smol_gather && snakemake --use-conda

That last command might take some time to run, so if you want only the cmash results you can do snakemake --use-conda outputs/cmash/SRR606249.csv outputs/cmash/SRR606249-k{21,31}-n{1000,1000000}.csv outputs/cmash_paper/SRR606249-k{21,31}-n{1000,1000000}.csv
for k=(21,31) and n=(1000,1000000) or, if you want to see all the commands that would be executed without using snakemake to run them, you can also use the -np flags (-n for dry run, -p for printing commands).

(I was also running for k=51 for new CMash, but the paper version fails during NodeGraph construction, which is a bit embarassing for me since it's using khmer 😞 )

…er size before computing intersection #15

dkoslicki added enhancement good first issue labels Feb 6, 2020

dkoslicki assigned IsaacT1123 Feb 7, 2020

IsaacT1123 added a commit that referenced this issue Mar 25, 2020

create Intersect class & kmer size retreival from DB #15

80a09f4

IsaacT1123 referenced this issue Mar 27, 2020

transfer rest of script funcs into Intersect class & generalize pathing

4dd616b

IsaacT1123 added a commit that referenced this issue Mar 31, 2020

add kmer intersection option #15

0566c3c

IsaacT1123 referenced this issue Mar 31, 2020

use absolute pathing for intersection input files

d7e197e

dkoslicki added a commit that referenced this issue Mar 31, 2020

PEP8 fixes, check for existance of file paths, use absolute instead o…

9e78b3a

…f relative paths on Class __init__ #15

dkoslicki added a commit that referenced this issue Mar 31, 2020

add test to run_small_tests for issue #15

aeff6d6

dkoslicki added a commit that referenced this issue Mar 31, 2020

reference db_kmers_loc instead of reconstructing it, change from Pope…

7fb1d25

…n to run. #15

dkoslicki added a commit that referenced this issue Mar 31, 2020

switch to run from Popen, suppress output, check if command ran succe…

c788929

…ssfully #15

dkoslicki added a commit that referenced this issue Mar 31, 2020

also add verbosity when doing the counting #15

415160d

dkoslicki added a commit that referenced this issue Mar 31, 2020

add checking of input_types #15

db86852

dkoslicki added a commit that referenced this issue Mar 31, 2020

bring input kmers count file to init so it doesn't need to be reconst…

7f70252

…ructed later on #15

dkoslicki added a commit that referenced this issue Mar 31, 2020

more conversion to run from Popen, with included verbosity and error …

94a4fc8

…checking #15

dkoslicki added a commit that referenced this issue Mar 31, 2020

add comment about the -fm flag if the input is in fastq format #15

a5ae4e0

dkoslicki added a commit that referenced this issue Mar 31, 2020

move the file references in intersect up to init for easier future mo…

d8220d8

…dification #15

dkoslicki added a commit that referenced this issue Mar 31, 2020

more Popen to run, and verbosity control. #15

9f6a00c

dkoslicki added a commit that referenced this issue Mar 31, 2020

add a bit more error handling #15

65990d9

dkoslicki added a commit that referenced this issue Mar 31, 2020

missed anoter Popen replace with run #15

8ec7889

dkoslicki added a commit that referenced this issue Mar 31, 2020

attempting to debug issues with KMC stuff for #15, so creating a new …

1af1b2d

…branch since a lot will get messed up.

dkoslicki added a commit that referenced this issue Mar 31, 2020

modified the small tests to show the issue in #15 with kmc_dump

ac6e6b2

dkoslicki added a commit that referenced this issue Mar 31, 2020

add -ci0 to the intersect for #15. still no dice

3dcd8a2

dkoslicki added a commit that referenced this issue Apr 1, 2020

found that the first KMC issue is in counting the database k-mers. Se…

dd86529

…e replicate_in_python.py after the FIXME for #15

dkoslicki added a commit that referenced this issue Apr 1, 2020

first issue found: training k-mers were dumped in multi-fasta format,…

51796be

… but kmc was called with -fa instead of -fm. This at least dumps more k-mers (544 instead of 3). #15 will now test if this is an accurate dump

dkoslicki added a commit that referenced this issue Apr 1, 2020

checked and KMC and python now agree on the dumped database k-mers. #15

c31b3cc

dkoslicki added a commit that referenced this issue Apr 1, 2020

agrees on read k-mers, problem with kmc_dump atm. See line 155 in rep…

a82082e

…licate_in_python.py #15

dkoslicki added a commit that referenced this issue Apr 2, 2020

everything seems to agree with the python replication of Query.Inters…

62a0d71

…ection(), so appears to be a problem further down the line #15

dkoslicki added a commit that referenced this issue Apr 2, 2020

finally figured out what's going on. See comments in #15 for description

6e26050

dkoslicki added a commit that referenced this issue Apr 2, 2020

small bug fix and add note about streaming k-mers through instead of …

0968601

…reads through. #15

dkoslicki added a commit that referenced this issue Apr 6, 2020

add a couple of helpful print lines #15

abc4f9c

IsaacT1123 added a commit that referenced this issue Apr 24, 2020

Merge branch 'master' into kmc_issue15 #15

7fc2e86

IsaacT1123 added a commit that referenced this issue Apr 24, 2020

Check that kmer size range is singular and agrees with training DB km…

742a796

…er size before computing intersection #15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved classification time with KMC #15

Improved classification time with KMC #15

dkoslicki commented Feb 6, 2020

dkoslicki commented Feb 7, 2020

IsaacT1123 commented Feb 7, 2020

dkoslicki commented Feb 7, 2020

dkoslicki commented Mar 18, 2020

IsaacT1123 commented Mar 19, 2020

dkoslicki commented Mar 31, 2020

dkoslicki commented Mar 31, 2020 •

edited

Loading

dkoslicki commented Mar 31, 2020

dkoslicki commented Apr 1, 2020

dkoslicki commented Apr 2, 2020

luizirber commented Apr 2, 2020 •

edited

Loading

dkoslicki commented Apr 6, 2020

luizirber commented Apr 6, 2020

luizirber commented Apr 7, 2020

dkoslicki commented Apr 7, 2020

luizirber commented Apr 7, 2020

Improved classification time with KMC #15

Improved classification time with KMC #15

Comments

dkoslicki commented Feb 6, 2020

dkoslicki commented Feb 7, 2020

IsaacT1123 commented Feb 7, 2020

dkoslicki commented Feb 7, 2020

dkoslicki commented Mar 18, 2020

IsaacT1123 commented Mar 19, 2020

dkoslicki commented Mar 31, 2020

dkoslicki commented Mar 31, 2020 • edited Loading

dkoslicki commented Mar 31, 2020

dkoslicki commented Apr 1, 2020

dkoslicki commented Apr 2, 2020

luizirber commented Apr 2, 2020 • edited Loading

dkoslicki commented Apr 6, 2020

luizirber commented Apr 6, 2020

luizirber commented Apr 7, 2020

dkoslicki commented Apr 7, 2020

luizirber commented Apr 7, 2020

dkoslicki commented Mar 31, 2020 •

edited

Loading

luizirber commented Apr 2, 2020 •

edited

Loading