-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is a nodegraph a bloom filter? #1898
Comments
Also see: czbiohub-sf/orpheum#8 (comment) |
Close. You are correct that I typically visualize khmer's implementation of the Nodegraph/Nodetable/Countgraph/Counttable as a table with several rows, many columns, and a jagged right edge. Each row in this table is what khmer is referring to with When it comes time to inserting or querying the filter, the element is hashed So rather than allocating a single bit vector (or byte vector for the Count-min sketch) |
Ah, okay got it!!! That makes a lot of sense. This helps a LOT with
calculating expected false positives.
Warmest,
Olga
---
Olga Botvinnik, PhD
olgabotvinnik.com <http://www.olgabotvinnik.com>
…On Tue, Nov 5, 2019 at 4:07 PM Daniel Standage ***@***.***> wrote:
Am I understanding correctly that for Nodegraph, the n_tables is the same
as the number of hash functions, i.e. k in the wikpedia? And is the
tablesize the same as m from wikipedia?
Close. You are correct that n_tables is equivalent to k in the wikipedia
description. But tablesize is not equivalent to m in the Wikipedia
definition. Rather, m is the sum (approximately) of all the tables in the
Nodegraph.
I typically visualize khmer's implementation of the
Nodegraph/Nodetable/Countgraph/Counttable as a table with several rows,
many columns, and a jagged right edge. Each row in this table is what khmer
is referring to with n_tables, and tablesize is the target length of each
row. I say "target" length here, because the rows aren't the same length.
khmer selects the length of each row by starting at tablesize and finding
the next n prime numbers smaller than tablesize.
When it comes time to inserting or querying the filter, the element is
hashed n times, the hash function simply being division modulo the row
length for each row in the table. Because each row length is a different
prime number, the n hash functions are linearly independent (a property
required for accuracy guarantees).
So rather than allocating a single bit vector (or byte vector for the
Count-min sketch) m units long and implementing k linearly independent
hash functions (as commonly described for Bloom filters), khmer allocates
k bit/byte vectors m/k units long each, using division modulo the vector
length as the hash function for each of the k vectors.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1898?email_source=notifications&email_token=AAGE24ATK7WO7KHCUDCYKY3QSIDFVA5CNFSM4JI4FRAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDEZBWA#issuecomment-550080728>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGE24HGR7NBUSDV64SUW4LQSIDFVANCNFSM4JI4FRAA>
.
|
Hello! Seems like I should have figured this out by now but somehow I haven't.
I'm calculating the expected false positive rates for various table sizes and numbers of hash functions. Am I understanding correctly that for Nodegraph, the
n_tables
is the same as the number of hash functions, i.e.k
in the wikpedia? And is thetablesize
the same asm
from wikipedia?Here is some code to calculate the false positive probability that all k-mers from a read were false positives:
And here's some example error rates:
The text was updated successfully, but these errors were encountered: