Training API #172

vvmnnnkv · 2020-08-10T10:20:28Z

Feature Description

Add Useful API for training cycle so that user doesn't need to code training loop from scratch each time.

Add new methods in Job:

Job.request()
Same as we currently do inside the .start() method (auth, download of model and plan).
trainingProcess = Job.train(trainingPlan, parameters)
Helper for training loop

trainingProcess - object would contain current epoch, batch, modelParameters
trainingPlan - string
parameters - dict of values:

planInputs: list of PlanInputSpec
planOutputs: list of PlanOutputSpec
data: tensor
target: (optional) tensor
epochs: number - how many epoch to train
batchSize: number
stepsPerEpoch: (optional) number - max number of steps per epoch
events: list of handlers: 'start', 'end', 'epochStart', 'epochEnd', 'batchStart', 'batchEnd', 'error'

PlanInputSpec: object that describes plan input argument

type: 'data' | 'target' | 'epoch' | 'batchSize' | 'step' | 'modelParameter' | 'value'
index: number
name: (optional) string
value: (optional) tensor

PlanOutputSpec: object that describes plan output

type: 'loss' | 'metric' | 'modelParameter'
index: number
name: (optional) string

Pseudo code:
Training loop:

train(...):
  x, y = get_batch(data, batchSize), get_batch(target, batchSize)
  stepsPerEpoch = stepsPerEpoch || len(data) / batchSize
  trigger_event('start')
  modelParameters = job.model.parameters
  for (i = 0; i < epochs; i++) {
    trigger_event('epochStart', (i))
    for (j = 0; j < stepsPerEpoch; j++) {
      trigger_event('batchStart', (i, j))
      plan_args = resolve_inputs(planInputs,
        {
          modelParameters: modelParameters,
          data: x,
          target: y,
          epoch: i,
          batchSize: batchSize,
          step: j   
        }
      )
      raw_outputs = job.plans[trainingPlan].execute(...plan_args)
      outputs = resolve_outputs(planOutputs, raw_outputs)
      status = {loss: outputs.loss, metric: output.metric}
      modelParameters = outputs.modelParameter
      trigger_event('batchEnd', (i, j, status))
    }
    trigger_event('epochEnd', (i))
  }
  trigger_event('end')

Resolving plan inputs/outputs from specs:

resolve_inputs(specs, vars) {
  args = []
  for (spec in specs) {
    if (spec.type == 'value') {
      args.push(spec.value)
    } elseif (spec.index) {
      args.push(vars[spec.type][spec.index])
    } else {
      args.push(vars[spec.type])
    }
  }
  return args
}
     
resolve_outputs(specs, output) {
  out = {}
  i = 0
  for (spec in specs) {
    if (spec.index) {
      out[spec.type][index] = output[i]
    } else {
      out[spec.type] = output[i]
    }
    i++
  }
  return args
}

Example for input/output specs for MNIST training plan:

[{type: 'data'}, {type: 'target'}, {type: 'batchSize'}, {type: 'value', 'value': <lr>}, 
{type: 'modelParams', index: 0}, {type: 'modelParams', index: 1},
{type: 'modelParams', index: 2}, {type: 'modelParams', index: 3}]

[{type: 'loss'}, {type: 'metric'},
{type: 'modelParams', index: 0}, {type: 'modelParams', index: 1},
{type: 'modelParams', index: 2}, {type: 'modelParams', index: 3}]

What alternatives have you considered?

API was discussed in FL team.

Additional Context

n/a

The text was updated successfully, but these errors were encountered:

vvmnnnkv added Type: New Feature ➕ Introduction of a completely new addition to the codebase Type: Improvement 📈 Performance improvement not introducing a new feature or requiring a major refactor labels Aug 10, 2020

vvmnnnkv added this to the 0.3.0 milestone Aug 10, 2020

vvmnnnkv assigned mjjimenez Aug 10, 2020

This was referenced Aug 10, 2020

Training stop/resume and checkpointing #173

Open

Dataset/dataloader API #174

Open

cereallarceny mentioned this issue Aug 19, 2020

Allow for training state to be persisted to temporary storage in the event of a failure in iOS #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training API #172

Training API #172

vvmnnnkv commented Aug 10, 2020

Training API #172

Training API #172

Comments

vvmnnnkv commented Aug 10, 2020

Feature Description

What alternatives have you considered?

Additional Context