cargo-djanco
is a helper tool that generates a harness for executing
queries.
The tool provides a cargo command (djanco
). Executing it will analyze the
current crate looking for functions
and generate a Rust
source file that will set the queries up and execute each of them.
Install from within your cargo crate:
cargo install --git https://github.com/PRL-PRG/cargo-djanco
Generate harness:
cargo djanco
Execute your queries:
cargo --bin djanco --release -- \
--dataset-path DATASET_LIVES_HERE
--output-path WRITE_RESULTS_HERE
cargo-djanco
is used as a tool within an existing Rust crate containing some
queries. The crate should depend on
. Queries are functions
tagged with as djanco
. Example:
#[djanco(April, 2020, subsets(C, Python, SmallProjects))]
pub fn my_query(database: &Database, _log: &Log, output: &Path) -> Result<(), std::io::Error> {
database.projects()
.group_by(project::Language)
.sort_by(project::Stars)
.sample(Top(1000))
.into_csv_in_dir(output, "top_1000_by_stars.csv")
}
The queries are functions with any name, three arguments, and that return a
Result
type. They can be located in any module in the crate. The arguments
represent:
- a reference to a database (
&djanco::data::Database
) where the query should get its data from, - a log (
djanco::log::Log
) where the query will write runtime logs into, - output path (
std::path::Path
) of a directory where the query should write its results.
The tag has several arguments as well:
- Two describe a savepoint timestamp at which the database is checked out: month and year, both optional. If they are not provided, the current month and/or year are used.
subset
/subsets
describes which subsets of the datastore to use in the query. If not provided, all subsets will be checked out.
This cargo command analyzes the crate, extracts the queries and generates a harness:
cargo djanco
The harness is a single source file in src/bin/djanco.rs
. It can be compiled
and executed. It contains boilerplate code for running queries, measuring their
execution times, and archiving them to a git repository.
Queries are grouped by their database parameters. For each group code is emitted that sets up the database and executes all queries that use this specific configuartion.
let database = prepare_database!(1585699200 /* = April 2020*/, stores!(C, Python, SmallProjects));
execute_query!(database, my_crate::my_query1);
execute_query!(database, my_crate::my_query3);
execute_query!(database, my_crate::my_query2);
The order in which the harness will try to execute queries is not predictable.
A query can be executed as part of multiple groups if it was tagged with multiple configurations.
The whole crate can be archived on an archive repository. To do this, specify a repository when executing the harness:
cargo --bin djanco --release -- \
--dataset-path DATASET_LIVES_HERE
--output-path WRITE_RESULTS_HERE
--repository REPOSITORY_URL
This will check out a branch of the same name as the current crate, and push the entirety of the current crate's code as a single commit. Then, after the query is executed, the result directory will be pushed as a second commit.
It's not always wise to store the results of queries, since they can get quite big. This step can therefore be skipped:
cargo --bin djanco --release -- \
--dataset-path DATASET_LIVES_HERE
--output-path WRITE_RESULTS_HERE
--repository REPOSITORY_URL
--skip-results
Make sure the URL is SSH if you want to use SSH authentication :)