Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stop/resume and checkpointing #173

Open
vvmnnnkv opened this issue Aug 10, 2020 · 0 comments
Open

Training stop/resume and checkpointing #173

vvmnnnkv opened this issue Aug 10, 2020 · 0 comments
Assignees
Labels
Type: Improvement 📈 Performance improvement not introducing a new feature or requiring a major refactor Type: New Feature ➕ Introduction of a completely new addition to the codebase
Milestone

Comments

@vvmnnnkv
Copy link
Member

Feature Description

With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.

// Start the training
// Training object would contain current epoch, batch, modelParameters
training = Job.train(...)

Suggested API:

// Stop training
training.stop()

New events in Job.train: 'stop'

// User-defined serialization (serialize/unserialize/storage is up to user)
serialized_checkpoint = serialize(training)
unserialized_checkpoint = unserialize(serialized_checkpoint)

// Supplying checkpoint back to Job.train
training = Job.train(trainingPlan, {
   ...
   checkpoint: unserialized_checkpoint
})

Training loop should read properties of checkpoint
and load model params, epoch, step, batchSize, etc. from it.

What alternatives have you considered?

API was discussed in FL team.

Additional Context

See #172

@vvmnnnkv vvmnnnkv added Type: New Feature ➕ Introduction of a completely new addition to the codebase Type: Improvement 📈 Performance improvement not introducing a new feature or requiring a major refactor labels Aug 10, 2020
@vvmnnnkv vvmnnnkv added this to the 0.3.0 milestone Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Improvement 📈 Performance improvement not introducing a new feature or requiring a major refactor Type: New Feature ➕ Introduction of a completely new addition to the codebase
Projects
None yet
Development

No branches or pull requests

2 participants