This project has been created as part of a Bachelor project at IT-University of Copenhagen in Spring 2023. It aims to provide an open and transparent way to benchmark various Learned Bloom Filters. This project contains implementation of the following Bloom Filters in /bloom_filters
:
- Regular Bloom Filter
- Partitioned Learned Bloom Filter
- Adaptive Learned Bloom Filter
- Daisy Bloom Filter
Two different data sets are provided.
A data set containing labeled 450,176 URLs. 345,738 beneign and 104,438 malicous. The data set is provided in /data/raw/url_data
(source).
To vectorize the URL data run:
make vectorize
A synthetic Zipfean data set is provided. This can be regenerated:
make zipf
git clone [email protected]:BSc-learned-indexes/daisy-bf.git
cd daisy
We recommend that you install this project's dependencies in an isolated enviroment. If you are unfamiliar with this concept you can read more about it here.
python venv -m ~/.virtualenvs/daisy
source ~/.virtualenvs/daisy/bin/activate
pip install -r requirements.txt
We have provided a template to run a benchmarking experiment with the following settings:
- Large Random Forest Classifier as model
- 1 - px as the query distribution
- URL data set
- Full key set
A series of make
commands are provided to build the filters:
make adabf
make plbf
make daisy
Note: this command takes a while 🐌
make all
make plot_all
make plot_learned_bf
Example: 0.1% key to non-key ratio, query distribution: qx = 1 - px, model: Large Random Forest Classifier
- The directory
/experiments
contains all the data that is presented in the Bachelor's thesis's Experiments section. - The thesis can be read in
/thesis
.