Skip to content

Commit

Permalink
Add a page with some notes on reproducibility (#624)
Browse files Browse the repository at this point in the history
* Add draft without examples

* Expand information on the different levels
  • Loading branch information
PGijsbers authored Jun 13, 2024
1 parent d5a90a8 commit 4eabf9b
Show file tree
Hide file tree
Showing 2 changed files with 194 additions and 0 deletions.
193 changes: 193 additions & 0 deletions docs/using/reproducing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Reproducing a Benchmark Evaluation

This guide distinguishes three levels of reproducibility, which we will informally call *loose*, *balanced*, and *strict*.
We recommend *loose* reproducibility if you simply want to recreate the experimental setup, but don't care too much for versions.
The *balanced* steps are the generally recommended way to reproduce a specific versioned experiment, ignoring some details which _most likely_ do not affect the outcomes.
The *strict* steps provide additional details to pay attention to in order to recreate the exact setup as faithfully as possible.

::: note

We are working on making it easier for people to share all relevant information to reproduce an evaluation as faithfully
as possible in a way that is directly digestable by the AutoML benchmark tool itself. In the mean time, we do believe
that using the _balanced_ steps below produce qualitatively similar results.

## Loose Reproducibility
A loose reproduction means evaluating a specific framework on a specific task, using the modern version of the AutoML benchmark.
It is often possible to evaluate the desired framework with the most recent version of the benchmark.
It might even be possible to evaluate the specific (older) version of the AutoML framework of the work you want to reproduce.
Unless the framework received significant updates, this will generally result in a very similar result as the original.
The advantage of this is that you can just use the benchmark in the way you are used to (as described in ["Getting Started"](../getting_started.md)),
as most work on the benchmark itself does not impact the evaluation results, but you can use up-to-date features and documentation.
To improve the faithfulness of the reproducibility, use docker mode to constrain the frameworks resources more rigorously.

```commandline
python runbenchmark.py autogluon openml/s/271 1h8c_gp3 -m docker
```

!!! note "Example commands on this page may take a long time"

Example commands provided on this page are designed to reproduce (parts of) benchmarks.
These can incur a lot of compute time. For example, the command above will evaluate
AutoGluon on the entire classification suite and will take roughly 700 hours to complete.
Typically, large scale benchmarks should be run through some parallelization by splitting the
commands by task (`--task=`) and/or fold (`-fold=`).

## Balanced Reproducibility
A balanced reproduction aims to provide a faithful reproduction while avoiding a lot of the small details which require a lot of work but are almost certainly not going to lead to different results.
We recommend that you simply use the most recent release with the same _minor_ version.
Generally speaking, patch-level releases only address bugfixes or contain changes which do not affect the outcome of the experiments.
By using the latest release of the minor version, you may benefit from minor bugfixes or increased stability.
For the JMLR paper, that includes experiments on versions *2.1.0* through *2.1.7*, this means only using *2.1.7* for all experiments.

Similarly, we do not set any random seeds. In the AutoML Benchmark, random seeds are mostly used by the AutoML frameworks themselves.
The 10-fold cross-validation splits themselves are not determined by a random seed, but are instead consistent and provided by OpenML (or, alternatively, are defined in a file).
By default, the AutoML benchmark will provide a different random seed to the framework for each fold in the evaluation.
This means that the effect of any one random seed will not be large. As such, we expect to find similar results regardless of whether seeds are set or not.
As rerunning batches of jobs with different random seeds is not currently supported well, we recommend ignoring this aspect for the _balanced_ setup.

```commandline title="Setting up an version 2.1.7 of the AutoML benchmark by using the repository tags"
git clone https://github.com/openml/automlbenchmark
cd automlbenchmark
git checkout v2.1.7
pyenv shell 3.9 # (1)
python -m venv venv
source venv/bin/activate
python -m pip install -r requirements.txt
```

1. Older versions of the AutoML benchmark may require different versions of Python.
Here, we use [pyenv](https://github.com/pyenv/pyenv) to make sure we use Python 3.9, which is the required version for the v2.1.7 release.
Using the wrong version of Python _may_ work, but in many cases will lead to problems.
Version 2.0 requires Python 3.6.


If you encounter problems with this setup, it is possible that the version you try to install follows different instructions.
Please have a look at the version specific documentation that is checked in with the specific release.


## Strict Reproducibility
Strict reproducibility means a true best effort to recreate the exact same setup that the original had.
However, even with that accounted for, keep in mind that you will not likely get the exact same results because of factors like:

- Using slightly different hardware (even AWS EC2 instances are heterogeneous).
- Uncontrolled randomness in AutoML frameworks, such as those that arise from race conditions in multiprocess applications.
- Hardware errors and limitations, such as numerical precision errors.

That said, to reproduce the setup as closely as possible, you need access to the configurations that are used and the results file.
The results file specifies for each job which AMLB version, framework version, random seed, and so on, were used.
Using the same configuration file and the same AMLB version installed, you can then run a command like:

```commandline
python runbenchmark.py FRAMEWORK:VERSION STUDY CONSTRAINT --task=TASK_NAME --fold=FOLD_NO --seed=SEED -m aws
```

??? note "In the future..."

We do want to improve support for rerunning experiments defined in a `results.csv` file format.
This is useful not just for reproducibility, but also for running failed jobs, or scheduling a job matrix.
However, we currently do not have the resources to add this feature. We welcome contributions.

# Reproducing the JMLR Paper
When opting for a *balanced* evaluation, it should be sufficient to use benchmark version *2.1.7*.
Version *2.1.0* denotes the last minor benchmark update, and all subsequent releases are bugfixes to address issues that came up during evaluation.
Since most bugs only affected specific tasks or frameworks, only those jobs which were affected by the following patch-releases were run in later versions.

Install version *2.1.7* and evaluate each framework on a 32GiB, 8 vCPU machine with the command:

```commandline
python runbenchmark.py FRAMEWORK:2023Q2 CONSTRAINT STUDY -m docker
```
Note that `FRAMEWORK` here is any of the predefined configurations of `resources/frameworks_2023Q2.yaml`,
`CONSTRAINT`s used were `1h8c_gp3` and `4h8c_gp3`, and `STUDY` was one of `openml/s/269` or `/openml/s/271`.

We used AWS's `m5.2xlarge` instances for this (using `-m aws` instead of `-m docker`), so we could also parallelize the evaluations (`-p` parameter).
If you plan to use AWS, make sure to update your configuration accordingly, so you use the right AMLB version and instance type.
See the AWS configuration section for how to get started with the AMLB on AWS, and be warned that this can get expensive quickly.

??? abstract "Example of a custom configuration files"

```yaml title="~/.config/automlbenchmark/config.yaml"
# put this file in your ~/.config/automlbenchmark directory
# to override default configs
---
project_repository: https://github.com/openml/automlbenchmark#v2.1.7
benchmarks: # configuration namespace for the benchmarks definitions.
definition_dir: # list of directories containing the benchmarks yaml definitions.
- '{root}/resources/benchmarks'
- '{user}/benchmarks'
constraints_file:
- '{user}/constraints.yaml'
- '{root}/resources/constraints.yaml'

versions:
python: 3.9

aws:
use_docker: true
iam:
temporary: false # set to true if you want IAM entities (credentials used by ec2 instances) being recreated for each benchmark run.
credentials_propagation_waiting_time_secs: 360 # increase this waiting time if you encounter credentials issues on ec2 instances when using temporary IAM.

s3:
bucket: NAME-OF-BUCKET-WE-USED # automl-benchmark-bucket # ALWAYS SET this bucket name as it needs to be unique in entire S3 domain. # automl-benchmark-697442f1
# (40 chars max as the app reserves some chars for temporary buckets)
# if you prefer using temporary s3 buckets (see below), you can comment out this property.
temporary: false # set to true if you want a new s3-bucket being temporarily created/deleted for each benchmark run.

ec2:
terminate_instances: always # see resources/config.yaml for explanations: you may want to switch this value to `success` if you want to investigate on benchmark failures.
spot:
enabled: true
max_hourly_price: 0.20 # comment out to use defaulti
monitoring:
cpu:
query_interval_seconds: 900
abort_inactive_instances: false
regions:
eu-north-1:
ami: ami-0989fb15ce71ba39e

resource_files: # this allows to automatically upload custom config + frameworks to the running instance (benchmark files are always uploaded).
- '{user}/config.yaml'
- '{user}/frameworks.yaml'
- '{user}/extensions'
- '{user}/benchmarks'
- '{user}/constraints.yaml'

job_scheduler:
retry_on_errors: # Boto3 errors that will trigger a job reschedule.
- 'SpotMaxPriceTooLow'
- 'MaxSpotInstanceCountExceeded'
- 'InsufficientFreeAddressesInSubnet'
- 'InsufficientInstanceCapacity'
- 'RequestLimitExceeded'
- 'VolumeLimitExceeded'
- 'VcpuLimitExceeded'
retry_on_states: # EC2 instance states that will trigger a job reschedule.
- 'Server.SpotInstanceShutdown'
- 'Server.SpotInstanceTermination'
- 'Server.InsufficientInstanceCapacity'
- 'Client.VolumeLimitExceeded'
- 'VcpuLimitExceeded'


job_scheduler:
max_parallel_jobs: 1000
delay_between_jobs: 8

inference_time_measurements:
enabled: false
additional_job_time: 1800

frameworks: # configuration namespace for the frameworks definitions.
definition_file: # list of yaml files describing the frameworks base definitions.
- '{root}/resources/frameworks.yaml'
- '{user}/frameworks.yaml'
```

When planning to reproduce the experiments with the *strict* steps, you will need to reference the `results.csv` file as described in "[strict reproducibility](#strict-reproducibility)".
While evaluating frameworks for the JMLR paper, we frequently found minor issues with both the AutoML benchmark and the AutoML framework.
We often did a best effort to resolve these issues, which resulted in multiple patch releases to e.g., update a framework definition, update an integration script to account for changes in a new release of an AutoML framework.
Unfortunately, this makes a *strict* reproduction of the experiments even harder, as now different releases have to be used for different parts of the experiments.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ nav:
- Configuration: using/configuration.md
- using/aws.md
- using/result_analysis.md
- using/reproducing.md
- Extending the Benchmark:
- extending/index.md
- extending/benchmark.md
Expand Down

0 comments on commit 4eabf9b

Please sign in to comment.