Learned-Rank-Select-ALENEX21

This repository contains the code to reproduce the experiments in the ALENEX21 paper:

Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A "learned" approach to quicken and compress rank/select dictionaries. In Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX). SIAM, 2020.

In brief, the paper follows a recent line of research on the so-called learned data structures. It provides the first “learned” scheme for implementing a rank/select dictionary over compressed space. In particular, it introduces a novel lossless compressed storage scheme for the input dictionary S which turns this problem into the one of approximating a set of points in the Cartesian plane via segments, so that the storage of S can be defined by means of a compressed encoding of these segments and the “errors” they do in approximating the input integers. Proper algorithms and data structures are then added to this compressed storage scheme to support fast rank and select operations.

Build and run

Please clone this repository using the flag --recurse-submodules. To run the experiments you need CMake 3.8+, and a compiler with support for C++17, Boost, and OpenMP. To compile the executables, issue the following commands:

mkdir build && cmake . -B build -DCMAKE_BUILD_TYPE=Release
cd build && make
cp build/Comparison ./Comparison
cp build/Generate_datasets ./Generate_datasets

The latter commands compile both the program for the benchmark (comparison.cpp) and the program for the datasets generation (generate_datasets.cpp). To get the already manipulated datasets you can download them here. Put them in the directory datasets.

The experiments can be run with the following script, which will populate a result directory with a csv file:

bash run_all.sh

The experiments require at least 32 GB of RAM and may take quite some time to finish (approximately 9 hours on our machine, whose specs are detailed below).

Tests

Since this benchmark deals with very different data structures implementations that have slightly different operations there is a bunch of tests that check the actual correctness of every implementation and every wrapper. They are in the directory tests. To run them:

tests/my_tests

Datasets

They are generated in the following way:

GOV2

Download the full GOV2 Inverted Index here. Then run:

Generate_datasets write_integers_from_GOV2 <gov2_file>

The latter command creates the directories:

datasets/GOV2/100K-1M of all the inverted lists with a number of elements between 100 thousand an 1 million.
datasets/GOV2/1M-10M of all the inverted lists with a number of elements between 1 million an 10 million.
datasets/GOV2/10M- of all the inverted lists with a number of elements grater than 10 million.

All these files containing inverted lists are ready for the benchmark.

URL

Download the dataset of URLs used from Classification Task available here and extract the second column:

awk -F "\"*,\"*" '{print $2}' URL_Classification.csv > URL_Classification_justURL.csv

Then download the dataset of URLs of News available here and extract the third column of both files:

awk -F "\"*,\"*" '{print $3}' news-week-18aug24.csv > news-week_justURL.csv
awk -F "\"*,\"*" '{print $3}' news-week-17aug24.csv >> news-week_justURL.csv

Then download the dataset of URLs of DOI available here and extract the second column:

awk -F "\"*,\"*" '{print $2}' 2013.csv > 2013_just_URL.csv

Merge and sort all the URLs:

cat URL_Classification_justURL.csv 2013_just_URL.csv news-week_just_URL.csv >>  my_url_all.txt
sort -o my_url_all.txt my_url_all.txt

Compile the Burrows–Wheeler transform tool available here and run:

bwte my_url_all.txt -m 8162

The latter command creates the file my_url_all.txt.bwt In the end extract the positions of the most common character using:

Generate_datasets write_integers_from_BWT my_url_all.txt.bwt

5GRAM

We use more than 100 GB of 5GRAM from Google corpus Version 20090715. The exact list of the used files is in this file. To aggregate all the 5GRAMs using run:

Generate_datasets write_google_ngram_v1 googlebooks-eng-all-5gram-20090715-xxx.csv ... googlebooks-eng-all-5gram-20090715-yyy.csv

This creates a file called ALL_google-eng-all-5gram.txt that contains all the different 5grams in the files. Compile the Burrows–Wheeler transform tool available here and run:

bwte ALL_google-eng-all-5gram.txt -m 8162

The latter command creates the file ALL_google-eng-all-5gram.txt.bwt. In the end extract the positions of the most common character using:

Generate_datasets write_integers_from_BWT ALL_google-eng-all-5gram.txt.bwt

DNA

Download the full human genome file from here, file type Genomic FASTA (.fna) and extract it. Then get the just "ACGT" version of the genome:

grep -v '^>' GCF_000001405.39_GRCh38.p13_genomic.fna > genomemic_tmp.unmasked
awk '{if(/^[^>]/)$0=toupper($0);print $0}' genomemic_tmp.unmasked  > tmp.txt && mv tmp.txt genomemic_tmp.unmasked
tr -d 'N' < genomemic_tmp.unmasked > tmp.txt && mv tmp.txt genomemic_tmp.unmasked
tr -d 'B' < genomemic_tmp.unmasked > tmp.txt && mv tmp.txt genomemic_tmp.unmasked
tr -d 'M' < genomemic_tmp.unmasked > tmp.txt && mv tmp.txt genomemic_tmp.unmasked
tr -d '\n' < genomemic_tmp.unmasked > tmp.txt && mv tmp.txt genomemic.unmasked
rm genomemic_tmp.unmasked

The latter commands return the file genomemic.unmasked of 3.2GB, truncate the file to the first 1GB:

truncate -s 1000000000 genomemic.unmasked

In the end to extract the positions of all the nucleotides run:

Generate_datasets write_integers_from_genome genomemic.unmasked

Analyse the results

The output files can be analysed using the following Jupyter notebooks:

plots/plot_bar.ipynb generates the Figure 3 of the paper. The output is the file plot_bars.tex.
plots/plot_space_time.ipynb generates the Figure 4 and the Figure 5 of the paper. The outputs are the files plot_rank.tex and plot_select.tex.

Test environment

The code was tested on the following machine:

Component	Specs
CPU	Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz
RAM	41 GB
L1 cache	32 KB (data) 32 KB (instructions)
L2 cache	256 KB
L3 cache	1 MB
OS	Ubuntu 16.04.6 LTS
Compiler	gcc 9.2.1
CMake	version 3.13.2

License

This project is licensed under the terms of the GNU General Public License v3.0.

If you use this code for your research, please cite:

@inproceedings{Boffa:2021,
	Author = {Boffa, Antonio and Ferragina, Paolo and Vinciguerra, Giorgio},
	Booktitle = {Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX)},
	Publisher = {{SIAM}},
	Title = {A ``learned'' approach to quicken and compress rank/select dictionaries},
	Year = {2021}}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
datasets		datasets
include		include
lib		lib
plots		plots
results		results
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
comparison.cpp		comparison.cpp
generate_datasets.cpp		generate_datasets.cpp
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learned-Rank-Select-ALENEX21

Build and run

Tests

Datasets

GOV2

URL

5GRAM

DNA

Analyse the results

Test environment

License

About

Releases

Packages

Languages

License

aboffa/Learned-Rank-Select-ALENEX21

Folders and files

Latest commit

History

Repository files navigation

Learned-Rank-Select-ALENEX21

Build and run

Tests

Datasets

GOV2

URL

5GRAM

DNA

Analyse the results

Test environment

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages