Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is a nodegraph a bloom filter? #1898

Open
olgabot opened this issue Nov 5, 2019 · 3 comments
Open

Is a nodegraph a bloom filter? #1898

olgabot opened this issue Nov 5, 2019 · 3 comments

Comments

@olgabot
Copy link
Contributor

olgabot commented Nov 5, 2019

Hello! Seems like I should have figured this out by now but somehow I haven't.

I'm calculating the expected false positive rates for various table sizes and numbers of hash functions. Am I understanding correctly that for Nodegraph, the n_tables is the same as the number of hash functions, i.e. k in the wikpedia? And is the tablesize the same as m from wikipedia?

Here is some code to calculate the false positive probability that all k-mers from a read were false positives:

def per_read_false_positive_coding_rate(n_kmers_in_read, n_total_kmers=1e7,
                                        n_hash_functions=DEFAULT_N_TABLES,
                                        tablesize=DEFAULT_MAX_TABLESIZE):
    exponent = - n_hash_functions * n_total_kmers / tablesize
    print(f"exponent: {exponent}")

    # Probability that a single k-mer is randomly in the data
    # per_kmer_fpr = math.pow(1 - math.exp(exponent), n_hash_functions)

    # Use built-in `exp1m` = exp - 1
    # - (exp - 1) = 1 - exp
    per_kmer_fpr = math.pow(- math.expm1(exponent), n_hash_functions)
    print(f"per kmer false positive rate: {per_kmer_fpr}")

    # Probability that the number of k-mers in the read are all false positives
    per_read_fpr = math.pow(per_kmer_fpr, n_kmers_in_read)
    return per_read_fpr

And here's some example error rates:

per_read_false_positive_coding_rate(30, 1e7, 4, 1e7) 
exponent: -4.0
per kmer false positive rate: 0.9287257558982396
Out[88]: 0.10879894741447617
per_read_false_positive_coding_rate(30, 1e7, 2, 1e7) 
exponent: -2.0
per kmer false positive rate: 0.7476450724155088
Out[89]: 0.00016250407621135139
@olgabot
Copy link
Contributor Author

olgabot commented Nov 5, 2019

Also see: czbiohub-sf/orpheum#8 (comment)

@standage
Copy link
Member

standage commented Nov 6, 2019

Am I understanding correctly that for Nodegraph, the n_tables is the same as the number of hash functions, i.e. k in the wikpedia? And is the tablesize the same as m from wikipedia?

Close. You are correct that n_tables is equivalent to k in the wikipedia description. But tablesize is not equivalent to m in the Wikipedia definition. Rather, m is the sum (approximately) of all the tables in the Nodegraph.

I typically visualize khmer's implementation of the Nodegraph/Nodetable/Countgraph/Counttable as a table with several rows, many columns, and a jagged right edge. Each row in this table is what khmer is referring to with n_tables, and tablesize is the target length of each row. I say "target" length here, because the rows aren't the same length. khmer selects the length of each row by starting at tablesize and finding the next n prime numbers smaller than tablesize.

When it comes time to inserting or querying the filter, the element is hashed n times, the hash function simply being division modulo the row length for each row in the table. Because each row length is a different prime number, the n hash functions are linearly independent (a property required for accuracy guarantees).

So rather than allocating a single bit vector (or byte vector for the Count-min sketch) m units long and implementing k linearly independent hash functions (as commonly described for Bloom filters), khmer allocates k bit/byte vectors m/k units long each, using division modulo the vector length as the hash function for each of the k vectors.

@olgabot
Copy link
Contributor Author

olgabot commented Nov 6, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants