Reproducability of old runs (Legacy Server?) #12

PGijsbers · 2024-09-02T14:31:03Z

Goal:
For reproducibility, it would be great if we could recompute any run. For instance, a run with an old scikit-learn version.

Challenges
This is a complex problem. It might, for instance, be the case that the scikit-learn wheel is not available anymore, or not for more modern python versions. Old versions might also have serious security vulnerabilities

Options

Setting up Legacy Servers, ran by OpenML, containing snapshots of the system for any run.
Making it possible to run a Legacy Server locally. This way, you can compute old runs on your own computer
Only safe all the metadata, without offering the possibility to run it. If you want to recompute an old run, we give you all the information on what you would need. Recreating the environment is your own responsibility. This would safe OpenML a lot of complexities.

The text was updated successfully, but these errors were encountered:

joaquinvanschoren · 2024-09-06T12:54:16Z

At this moment, I think we can only really promise option 3 (which is fair, I think).

Option 1, running many legacy servers won't be maintainable.

For Option 2, simple dockerfiles won't work. We'll need to containerize (using Docker or Singularity) the entire environment, including compiled binaries. That would make sense if we have a regular release cycle of the entire server, and that would really make more sense after we've moved everything to Python? Then we can auto-generate these containers with every major release, and link runs to an OpenML release, but this is not low-hanging fruit.

Thoughts?

PGijsbers · 2024-09-06T14:59:10Z

As I understand it, aren't we very close to option 2? @josvandervelde dockerized all server services. So all we would need to do is to tag current versions. Then, combined with older openml-python docker images, we mostly make good on the promise. At least as far as scikit-learn based runs go.
Insofar as there are runs which cannot be reproduced with that, I would fall back to option 3: make sure the metadata stays available (which as far as I am concerned was never really put into contention anyway).

joaquinvanschoren · 2024-09-06T15:40:49Z

Maybe. We would also need to store compiled containers, right? With wheels breaking and all that.

PGijsbers · 2024-09-08T16:40:16Z

The images (should) come with fully pre-installed environments. I am not entirely sure what you are referring to.

joaquinvanschoren · 2024-09-08T20:10:13Z

Then that should be fine. If there are multiple version of dependent libraries (eg. Scikitlearn, torch,...) between OpenML releases, would that still work?

PGijsbers · 2024-09-10T09:25:02Z

Gotcha. That's indeed an issue. Likely packages will stay available on PyPI. But we can build an images for (sklearn x openml) tuples (or other dependencies), to cover it a bit better. I am not sure where we should draw the line in that case.

josvandervelde added this to Roadmap (high level) Mar 6, 2023

PGijsbers assigned joaquinvanschoren Sep 2, 2024

PGijsbers converted this from a draft issue Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducability of old runs (Legacy Server?) #12

Reproducability of old runs (Legacy Server?) #12

PGijsbers commented Sep 2, 2024

joaquinvanschoren commented Sep 6, 2024

PGijsbers commented Sep 6, 2024

joaquinvanschoren commented Sep 6, 2024 •

edited

Loading

PGijsbers commented Sep 8, 2024

joaquinvanschoren commented Sep 8, 2024

PGijsbers commented Sep 10, 2024

Reproducability of old runs (Legacy Server?) #12

Reproducability of old runs (Legacy Server?) #12

Comments

PGijsbers commented Sep 2, 2024

joaquinvanschoren commented Sep 6, 2024

PGijsbers commented Sep 6, 2024

joaquinvanschoren commented Sep 6, 2024 • edited Loading

PGijsbers commented Sep 8, 2024

joaquinvanschoren commented Sep 8, 2024

PGijsbers commented Sep 10, 2024

joaquinvanschoren commented Sep 6, 2024 •

edited

Loading