Training stop/resume and checkpointing #173

vvmnnnkv · 2020-08-10T10:28:04Z

Feature Description

With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.

// Start the training
// Training object would contain current epoch, batch, modelParameters
training = Job.train(...)

Suggested API:

// Stop training
training.stop()

New events in Job.train: 'stop'

// User-defined serialization (serialize/unserialize/storage is up to user)
serialized_checkpoint = serialize(training)
unserialized_checkpoint = unserialize(serialized_checkpoint)

// Supplying checkpoint back to Job.train
training = Job.train(trainingPlan, {
   ...
   checkpoint: unserialized_checkpoint
})

Training loop should read properties of checkpoint
and load model params, epoch, step, batchSize, etc. from it.

What alternatives have you considered?

API was discussed in FL team.

Additional Context

See #172

The text was updated successfully, but these errors were encountered:

vvmnnnkv added Type: New Feature ➕ Introduction of a completely new addition to the codebase Type: Improvement 📈 Performance improvement not introducing a new feature or requiring a major refactor labels Aug 10, 2020

vvmnnnkv added this to the 0.3.0 milestone Aug 10, 2020

vvmnnnkv assigned mjjimenez Aug 10, 2020

cereallarceny mentioned this issue Aug 19, 2020

Add a stopping method that stops the training process in iOS #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stop/resume and checkpointing #173

Training stop/resume and checkpointing #173

vvmnnnkv commented Aug 10, 2020

Training stop/resume and checkpointing #173

Training stop/resume and checkpointing #173

Comments

vvmnnnkv commented Aug 10, 2020

Feature Description

What alternatives have you considered?

Additional Context