Training stop/resume and checkpointing #173
Labels
Type: Improvement 📈
Performance improvement not introducing a new feature or requiring a major refactor
Type: New Feature ➕
Introduction of a completely new addition to the codebase
Milestone
Feature Description
With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.
Suggested API:
New events in Job.train:
'stop'
Training loop should read properties of checkpoint
and load model params, epoch, step, batchSize, etc. from it.
What alternatives have you considered?
API was discussed in FL team.
Additional Context
See #172
The text was updated successfully, but these errors were encountered: