-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved classification time with KMC #15
Comments
Could be nice to write this as an optional script that will be run before the |
I’d be happy to work on this enhancement after getting CMash installed and creating a retraining script. |
Excellent! Go ahead and make a new branch (named something like |
@IsaacT1123 a friendly reminder to reference this issue in your commits! It's a great way to keep track of changes being made, and is also a visible history of your contributions to FOSS projects! |
I’ll be sure to do so from now on- thank you for reminding me |
…f relative paths on Class __init__ #15
Still in the process of diagnosing the issue. Will need to return to it later @dkoslicki |
…branch since a lot will get messed up.
Definitely an issue with the /usr/bin/time python ${scriptsDir}/MakeStreamingDNADatabase.py filenames.txt TrainingDatabase.h5 -k 10
/usr/bin/time python ${scriptsDir}/StreamingQueryDNADatabase.py ${testOrganism} TrainingDatabase.h5 results.csv 10-10-1 --sensitive --intersect -c 0 and then checking: grep -c '>' 10mers_intersection_dump.fa you get 3 10-mers, which is totally not accurate since the |
Indeed, it might be the
gives 689. I.e. if you dump the 10mers from the reads, dump the training database 10-mers, and count the ones in common, you get much more than 3... |
…e replicate_in_python.py after the FIXME for #15
… but kmc was called with -fa instead of -fm. This at least dumps more k-mers (544 instead of 3). #15 will now test if this is an accurate dump
@IsaacT1123 So the |
…ection(), so appears to be a problem further down the line #15
@IsaacT1123 Finally figured out the issue, and it's "obvious" now that I see what happens: When intersecting the reads 21-mers with the training database 21-mers, you miss out on some k-mers with k<21 that are in the reads (those that aren't prefixes of some database 21-mer). Eg. the query file has the 11-mer "TGCCCTGTGGC" in it, but since this isn't a prefix of a 21-mer in the training database, it isn't a prefix of any 21-mer included in the I will need to think about if the KMC approach will work for multiple k-mer sizes. Thankfully, applications like Metalign (that only use a single k-mer size), are unaffected by this issue. |
I was using both (even so, if you look at the table in this section there are some datasets with big differences to the ground truth...) |
Recall that last year when I presented this approach, I mentioned how when you take the prefix of a k-mer, it is not a truly random sample, but a biased sample. The theory has been worked out by how much this affects the containment index due to the bias, but unfortunately the theory says that the bias factor is data dependent. Hence why @ShaopengLiu1 is working on #20 to test this on realistic data sets. He too is seeing difference of roughly 10-30%, but it complete depends on what the actual containment index is (eg: very small or very high true containment index means k-mer sizes with
|
I'll also run for "cmash paper" (since I only have analysis with the new streaming cmash), and see how they compare.
I did use |
Done, and...
For "cmash paper" results are <1% from ground truth (when |
@luizirber I would be interested to see when non-“CMash paper” is underestimating badly (as in, the worst case of 0.51 when it should be 0.995). Maybe you could send me the training/testing data? Might be a usage issue (due to the constant flux of the streaming version of CMash), or could be some other bug or issue you’ve identified (since at least theoretically, high containment index is when this approach should work best). |
This is all coming from my thesis repo, here's a table with the containments: $ git clone https://github.com/luizirber/phd && cd phd
$ conda env create --force --file environment.yml
$ conda activate thesis
$ cd experiments/smol_gather && snakemake --use-conda That last command might take some time to run, so if you want only the (I was also running for |
…er size before computing intersection #15
When running
StreamingQueryDNADatabase.py
, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:StreamingQueryDNADatabase.py
Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.
The text was updated successfully, but these errors were encountered: