You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement the capability to resume model training from a saved checkpoint. This should include adding an API for resuming a model from a checkpoint, which could be either downloaded from Weights & Biases (w&b) or stored locally. We may be able to leverage the w&b cache.
Not sure exactly when this will be required. Is it just when a training run crashes?
Acceptance Criteria
Add an API for resuming model training from a checkpoint
Support checkpointed models downloaded from w&b
Support checkpointed models stored locally
The text was updated successfully, but these errors were encountered:
Happy to descope it as @afspies probably has the hackiest way to do this himself (joke), and others probably don't train models where this is worth implementing(?) Not sure, though. @mivanit you want to eye this in the future?
Description
Implement the capability to resume model training from a saved checkpoint. This should include adding an API for resuming a model from a checkpoint, which could be either downloaded from Weights & Biases (w&b) or stored locally. We may be able to leverage the w&b cache.
Not sure exactly when this will be required. Is it just when a training run crashes?
Acceptance Criteria
The text was updated successfully, but these errors were encountered: