Benchmarking Learned Bloom Filters

About ℹ️

Motivation

This project has been created as part of a Bachelor project at IT-University of Copenhagen in Spring 2023. It aims to provide an open and transparent way to benchmark various Learned Bloom Filters. This project contains implementation of the following Bloom Filters in /bloom_filters:

Data sets

Two different data sets are provided.

URL data set

A data set containing labeled 450,176 URLs. 345,738 beneign and 104,438 malicous. The data set is provided in /data/raw/url_data (source).

To vectorize the URL data run:

make vectorize

Synthetic Zipfean data set

A synthetic Zipfean data set is provided. This can be regenerated:

make zipf

Installation ⚙️

1. Clone the repository

git clone [email protected]:BSc-learned-indexes/daisy-bf.git
cd daisy

2. Recommended: Creating a virtual environment

We recommend that you install this project's dependencies in an isolated enviroment. If you are unfamiliar with this concept you can read more about it here.

Create the environment

python venv -m ~/.virtualenvs/daisy

Source the environment

source ~/.virtualenvs/daisy/bin/activate

3. Installing dependencies

pip install -r requirements.txt

Usage 📈

Benchmarking the Bloom Filters

We have provided a template to run a benchmarking experiment with the following settings:

Large Random Forest Classifier as model
1 - px as the query distribution
URL data set
Full key set

A series of make commands are provided to build the filters:

Build Adaptive Learned Bloom Filter

make adabf

Build Partitioned Learned Bloom Filter

make plbf

Build Daisy Bloom Filter

make daisy

Build all Bloom Filters

Note: this command takes a while 🐌

make all

Plot all Bloom Filters

make plot_all

Plot all the Learned Bloom Filters (excludes the regular Bloom Filter)

make plot_learned_bf

Example output 🖼️

Example: 0.1% key to non-key ratio, query distribution: qx = 1 - px, model: Large Random Forest Classifier

Extra 🤓

The directory /experiments contains all the data that is presented in the Bachelor's thesis's Experiments section.
The thesis can be read in /thesis.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
bloom_filters		bloom_filters
data		data
distributions		distributions
experiments		experiments
models		models
notebooks		notebooks
thesis		thesis
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
create_kx_heatmaps.py		create_kx_heatmaps.py
create_model.py		create_model.py
create_synthetic_dataset.py		create_synthetic_dataset.py
decorate_with_qx.py		decorate_with_qx.py
dependencies.md		dependencies.md
plot_distributions.py		plot_distributions.py
plot_hash_func_dist.py		plot_hash_func_dist.py
plot_kx_dist.py		plot_kx_dist.py
plot_size_FPR.py		plot_size_FPR.py
readme.md		readme.md
requirements.txt		requirements.txt
runbook.md		runbook.md
shell.nix		shell.nix
url_vectorizor.py		url_vectorizor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Learned Bloom Filters

About ℹ️

Motivation

Data sets

URL data set

Synthetic Zipfean data set

Installation ⚙️

1. Clone the repository

2. Recommended: Creating a virtual environment

Create the environment

Source the environment

3. Installing dependencies

Usage 📈

Benchmarking the Bloom Filters

Build Adaptive Learned Bloom Filter

Build Partitioned Learned Bloom Filter

Build Daisy Bloom Filter

Build all Bloom Filters

Plot all Bloom Filters

Plot all the Learned Bloom Filters (excludes the regular Bloom Filter)

Example output 🖼️

Extra 🤓

About

Releases

Packages

Contributors 2

Languages

License

BSc-learned-indexes/daisy-bf

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Learned Bloom Filters

About ℹ️

Motivation

Data sets

URL data set

Synthetic Zipfean data set

Installation ⚙️

1. Clone the repository

2. Recommended: Creating a virtual environment

Create the environment

Source the environment

3. Installing dependencies

Usage 📈

Benchmarking the Bloom Filters

Build Adaptive Learned Bloom Filter

Build Partitioned Learned Bloom Filter

Build Daisy Bloom Filter

Build all Bloom Filters

Plot all Bloom Filters

Plot all the Learned Bloom Filters (excludes the regular Bloom Filter)

Example output 🖼️

Extra 🤓

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages