Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume model training from checkpoint #127

Open
rusheb opened this issue Mar 25, 2023 · 2 comments
Open

Resume model training from checkpoint #127

rusheb opened this issue Mar 25, 2023 · 2 comments
Labels
code-quality code quality improvement

Comments

@rusheb
Copy link
Collaborator

rusheb commented Mar 25, 2023

Description

Implement the capability to resume model training from a saved checkpoint. This should include adding an API for resuming a model from a checkpoint, which could be either downloaded from Weights & Biases (w&b) or stored locally. We may be able to leverage the w&b cache.

Not sure exactly when this will be required. Is it just when a training run crashes?

Acceptance Criteria

  • Add an API for resuming model training from a checkpoint
  • Support checkpointed models downloaded from w&b
  • Support checkpointed models stored locally
@traeuker
Copy link
Member

Happy to descope it as @afspies probably has the hackiest way to do this himself (joke), and others probably don't train models where this is worth implementing(?) Not sure, though.
@mivanit you want to eye this in the future?

@mivanit
Copy link
Member

mivanit commented Jun 28, 2023

Probably 90% of the work required for this is done, so I think I might as well implement it.

@mivanit mivanit added the code-quality code quality improvement label Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-quality code quality improvement
Projects
None yet
Development

No branches or pull requests

3 participants