You must be signed in to change notification settings - Fork 0
SGD Block
The SGD configuration block controls the behavior of the SGD (Stochastic Gradient Descent) algorithm in CNTK. If you are familiar with other toolkits, be sure to check out
- How is the minibatch size defined in CNTK?
- How to convert learning rate and momentum parameters from other toolkits?
The SGD configuration block has the following structure and default values:
SGD = {
# Training process control
modelPath = ...
trainCriterionNodeName = ...
evalCriterionNodeName = ...
maxEpochs = ...
epochSize = 0
minibatchSize = 256
truncated = false
dropoutRate = 0
maxTempMemSizeInSamplesForCNN = 0
keepCheckPointFiles = false
disableWkInBatchNormal = false
# Learning rate and momentum control
learningRatesPerSample = ...
learningRatesPerMB = ...
minLearningRatePerSample = ...
momentumAsTimeConstant = ...
momentumPerMB = ...
useNAG = false
autoAdjust = {
autoAdjustLR = "none" # | "searchBeforeEpoch" | "adjustAfterEpoch"
autoAdjustMinibatch = false
# for autoAdjustLR = "adjustAfterEpoch":
reduceLearnRateIfImproveLessThan = 0
learnRateDecreaseFactor = 0.618
increaseLearnRateIfImproveMoreThan = (infinity)
learnRateIncreaseFactor = 1.382
loadBestModel = true
learnRateAdjustInterval = 1
useCVSetControlLRIfCVExists = true
useEvalCriterionControlLR = false
# for autoAdjustLR = "searchBeforeEpoch":
numMiniBatch4LRSearch = 500
numPrevLearnRates = 5
numBestSearchEpoch = 1
# for autoAdjustMinibatch = true:
numMiniBatch4LRSearch = 500
minibatchSizeTuningFrequency = 1
minibatchSizeTuningMax = 1048576
minibatchSearchCriterionErrorMargin = 1
parallelTrain = {
parallelizationMethod = "none" # | "dataParallelSGD" | "blockMomentumSGD" | "modelAveragingSGD"
parallelizationStartEpoch = 1
distributedMBReading = false
syncPerfStats = 0
# for parallelizationMethod = "dataParallelSGD"
dataParallelSGD =
gradientBits = (8*sizeof(precision)) # | 1 | 2
useBufferedAsyncGradientAggregation= false
# for parallelizationMethod = "blockMomentumSGD"
blockMomentumSGD = {
blockSize = (120000 * #workers)
blockMomentumAsTimeConstant = (-blockSize / log(1 - 1/#workers))
resetSGDMomentum = true;
useNesterovMomentum = true;
blockLearningRate = 1.0
# for parallelizationMethod = "modelAveragingSGD"
modelAveragingSGD = {
blockSize = (40000 * #workers)
# Gradient control
gradientClippingWithTruncation = true
clippingThresholdPerSample = (infinity)
L2RegWeight = 0
L1RegWeight = 0
gaussianNoiseInjectStd = 0
gradUpdateType = "" # "" | "adagrad" | "rmsProp" | "fsAdaGrad"
# for gradUpdateType = "adaGrad" or "rmsProp":
normWithAveMultiplier = true
# for gradUpdateType = "rmsProp":
rms_wgt_inc = 1.2
rms_wgt_dec = 0.75
rms_wgt_max = 10.0
rms_wgt_min = 0.1
rms_gamma = 0.99
# Information display
traceLevel = 0
firstMBsToShowResult = 10
numMBsToShowResult = 10
numMBsToCUDAProfile = 0
# Precompute
useAllDataForPreComputedNode = true
# Gradient check
gradientCheck = false
sigFigs = 6
: the name of the training criterion node. If not provided the default training criterion node in the network will be used. -
: the name of the evaluation criterion node. If not provided the default evaluation criterion node in the network will be used. -
: the number of samples (tensors along a dynamic axis) in each epoch. The epoch size in CNTK is the number of samples after which specific additional actions are taken, including- saving a checkpoint model (training can be restarted from here)
- cross-validation
- learning-rate control
- minibatch-scaling
For smaller data-set sizes,
is often set equal to the dataset size. You can specify0
to denote that. For large data sets, you may want to guide your choice forepochSize
by checkpointing. For example, if you want to lose at most 30 minutes of computation in case of a power outage or network glitch, you would want a checkpoint to be created about every 30 minutes (from which the training can be resumed). ChooseepochSize
to be the number of samples that takes about 30 minutes to compute.Note: Like for
minibatchSize, if your input data consists of sequences, such as sentences of text,
epochSize` is measured in individual items, not by the number sequences. -
: whether you want to keep the check point file after a new epoch starts. Valid values aretrue
(default). -
: whether to enable the weight decay term of batch normalization while SGD updates. Valid values aretrue
(default). -
: maximum number of epochs to run. -
: the minibatch size for each epoch, given in samples (tensors along a dynamic axis). The default value is256
. You can use different values for different epochs, e.g.,128*2:1024
means using minibatch size of 128 for the first two epochs and then 1024 for the rest. Note that 'Minibatch size' in CNTK means the number of samples processed between model updates. This definition also holds when parallelizing across workers (e.g. forK
workers, the number of samples each worker would process isminibatchSize/K
). In case of variable-length inputs,minibatchSize
refers to the number of items in these sequences, not the number of sequences. SGD will try to fit up to as many sequences as possible into the minibatch that does not exceedminibatchSize
total samples. If several inputs are given, tensors are added to the current minibatch until one of the inputs exceeds the minibatchSize. -
: dropout rate during the training procedure. Default is0.0
. Can use syntax such as 0.5*10:0.2 which means using dropout rate 0.5 for 10 epochs and then 0.2 for the rest. -
: maximum temporary memory used (in number of samples) when packaging and unpackaging input features. Default is 0, which means using any value as needed. Useful to control the memory foot print esp. when run under GPU.
Note CNTK's way of specifying learning rates and momentum differs from other toolkits. See here for a detailed description.
: the learning rates per epoch with which each sample's gradient updates the model. You can use different values for different epochs, e.g., 0.025*10:0.00625 means use the learning rate 0.025 for the first 10 epochs and then 0.00625 for the rest. This is the preferred way of specifying in CNTK, since it specifies the learning rates agnostic to the minibatch size, which is important when automatic minibatch-sizing is used. Other toolkits often specify learning rates in a minibatch-averaging fashion. To convert from that notation, use learning rate per sample = learning rate per MB /minibatchSize
(see here for more details). -
: alternative way of specifying learning rates to be applied to the average over samples in the minibatch. This is the most common way of specifying learning rates in other toolkits, but is problematic in CNTK's where data-parallel training, which modifies the minibatch size. Internally, this will be converted intolearningRatesPerSample
by dividing the values by the specified 'minibatchSize'. Mutually exclusive withlearningRatesPerSample
. -
: minimum learning rate per sample. When the learning rate per sample is smaller than this value the training process will terminate. This is often used to control early stopping when automatic learning rate adjustment is enabled. Default is 1e-9. -
: similarly tolearningratesPerSample
, CNTK specifies momentum in a minibatch-size agnostic way as the time constant (in samples) of a unit-gain 1st-order IIR filter. The value specifies the number of samples after which a gradient has an effect of 1/e=37%. Other toolkits often specify momentum as a per-minibatch weight (e.g. 0.9). To convert from that, usemomentumAsTimeConstant = -minibatchSize / ln (momentumPerMB)
. You can use syntax such as 20000*10:2500 which means using the momentum time constant 20000 for 10 epochs and then 2500 for the rest. -
: this alternative way of specifying momentum mimics the behavior of common toolkits. E.g. specifying 0.9 means that the previous gradient will be retained with a weight of 0.9. Note, however, that, unlike some other toolkits, CNTK still uses a unit-gain filter, i.e. the new gradient will be multiplied with(1-momentumPerMB)
. Internally, this will be converted intomomentumAsTimeConstant = -minibatchSize / ln (momentumPerMB)
. -
: contains the information related to the automatic learning rate control. Default value is empty (“”) which means no automatic learning rate control. Inside the block, there can be following values:-
: the automatic learning rate adjustment algorithm to use. Valid values areNone
(default, don’t auto adjust learning rate),AdjustAfterEpoch
(check the training criterion after each epoch using the development set of the training set and decide whether to adjust the learning rate), andSearchBeforeEpoch
(search the learning rate based on a small portion of the training set before each epoch starts). -
When used in the
: reduce the learning rate if the improvement is less than this value. Default is0
. -
: the learning rate decrease factor. Default value is0.618
. -
: increase the learning rate if the improvement is larger than this value. Default value is1#INF
(infinity) which means never increase. -
: the learning rate increase factor. Default value is1.382
. -
: weather to load the best model if the current model decreases the performance. Valid values aretrue
(default) andfalse
. -
: determine the frequency of applying the learning rate adjustment check. Default is1
epoch. If this value is set to a value larger than 1 the learning rate adjustment will be based on the average criterion computed from the lastlearnRateAdjustInterval
epochs. -
: use evaluation criterion instead of the training criterion to control the learning rate. By default it's false.
When used in the
: the number of minibatches used to search the learning rate. Default value is500
. It’s typically set to 10-20% of the total minibatches in an epoch. -
: number of previous learning rates used as a hint to the search range. Default value is5
. -
: number of epochs in which we use the best learning rate instead of the sufficient learning rate . Default value is1
When used in the 'AdaptiveMinibatchSizing' mode.
: the number of minibatches used to search the minibatch size when in adaptive minibatch size mode. Default value is500
. It’s typically set to 10-20% of the total minibatches in an epoch this is shared with the search for learning rate inSearchBeforeEpoch
mode. -
: enable or disable whether minibatch size is adaptively adjusted. Default value isfalse
. Adapative minibatch sizing will begin on epochs starting after user minbatch sizes expcitily specified are complete. For example if the user specifed minibatchSize=256:1024, then 256 and 1024 are used in the first 2 Epochs and adaptive minibatch sizing is used afterwards. -
: The number of epochs to skip, on a periodic basis, before dynamically adjusting the minibatch size. Default value is1
. -
: The maximum size allowed for an adaptively adjusted minibatch size. Default value is1048576
: whether to use the truncation based gradient clipping to control gradient explosion. Valid values aretrue
(default) andfalse
. If it is false the norm based clipping will be used instead which is more expensive. -
: the clipping threshold for each sample. Default value is1#INF
which means infinity (i.e., clipping is turned off). -
(default 0): the L2 regularization weight per sample. The Frobenius norm of the learnable parameter is added to the objective with this weight. This is specified per sample, meaning that the Frobenius norm is multiplied by the number of samples in the minibatch. -
(default 0): the L1 regularization weight per sample. -
: gradient update type. Valid values areNone
(default, no special treatment to the gradient),AdaGrad
, andRmsProp
.- When
equals toAdaGrad
, you can control the behavior of the gradient update using following parameters:-
: normalize the gradient with the average multipliers applied to the gradients by the AdaGrad/RmsProp algorithm. Default istrue
- When
equals toRmsProp
, you can control the behavior of the gradient update using following parameters:-
: multiplicative increment of the learning rate scale. Default is1.2
. -
: multiplicative decrement of the learning rate scale. Default is0.75
. -
: maximum learning rate scale allowed. A value closer to 1 makes the learning rate adjustment more stable but slower. Default is10
. -
: minimum learning rate scale allowed. A value closer to 1 makes the learning rate adjustment more stable but slower. Default is0.1
. -
: smoothing factor used to estimate the moving average of the variance. The smaller the value, the quicker it forgets the past information. Default is0.99
- When
: the standard deviation of the Gaussian noise added when using theAdaGrad
approach. Default is0
: trace level to decide what information to print out in the stderr. Valid values are0
(default) and1
. -
: display training statistics after how many minibatches. Default is10
: determines whether to use the gradient checker. The default value isfalse
. When using the gradient checker you need to use a minibatch size that is larger than the sequence length for RNNs due to the truncated backpropagation through time (BPTT) algorithm used to train RNNs, and a smaller learning rate to prevent numerical issues caused by divergence. In addition, precision should be set to double.
The behavior of the SGD algorithm (Stochastic Gradient Descent Learner) is controlled by the SGD block of the options. When an option is omitted the default value is assumed.
Parameters that are not explicitly specified are left to the default values.
CNTK has a very specific definition of minibatchSize
parameter: It denotes the number of samples between model updates.
A sample here is defined as one vector or tensor flowing through the system.
For example, in an image recognition task, one image is one sample.
Importantly, for sequential data, a sample is an individual item of a sequence.
Hence, CNTK's minibatchSize
does not refer to the
number of sequences in the minibatch,
but the aggregate number of sequence items/tokens across the sequences that constitute the minibatch.
CNTK has native support for variable-length sequences, i.e. it can accomodate
sequences of highly varying lengths within the same minibatch, without need for workarounds like bucketing.
Together with CNTK's notion of specifying the learning rate per sample (instead of a minibatch average),
every item of sequences of any length contributes the same to the gradient,
leading to consistent convergence.
(Many other toolkits define the minibatch size for sequential data as the number of sequences
in the minibatch.
This is problematic, especially if gradients are also defined as minibatch averages rather than
CNTK's minibatch sums, because the contribution to the gradient from each token or step in a sequence
would be inversely proportional to the sequence length. CNTK's approach avoids this.)
When multiple Input{}
s are used, it is possible that not all inputs have the same sequence length.
For example, in sequence classification the label of each sequence is a single token.
In this case, the input with the largest number of samples controls.
Despite our clear definition of minibatchSize
being the number of samples between model updates,
there are two occasions where we must relax the definition:
- sequential data: Variable-length sequences do not generally sum up to exactly the requested minibatch size. In this case, as many sequences as possible are packed into a minibatch without exceeding the requested minibatch size (with one exception: If the next one sequence in the randomized corpus exceeds the length of the minibatch size, the minibatch size will consist of this sequence).
- data parallelism: Here, the minibatch size is approximate, as our chunk-based randomization algorithm cannot guarantee that each worker receives precisely the same number of samples.
All of the above considerations also apply to epochSize
CNTK's model-update formulae differ somewhat from some other toolkits and from literature, in that in CNTK, the parameters are specified in a way that is agnostic of the minibatch size. This is important in the context of data-parallel training, where CNTK itself may modify the minibatch size. Specifying learning rate and momentum in an agnostic way avoids complexities of adjusting these values upon changes of minibatch size.
These are CNTK's model-update formulae for SGD with momentum:
G(t) = (1-mu) sum { g(t-minibatchSize+1) ... g(t) } + mu * G(t-minibatchSize)
mu = exp (-minibatchSize/momentumAsTimeConstant)
M(t) = M(t-minibatchSize) + learningRatePerSample G(t)
: momentum-smoothed gradient aftert
samples -
: raw gradient of sample at timet'
: model used after seeingt
samples. -
incrementing in steps ofminibatchSize
(Note: When using variable-length sequences, minibathSize
will slightly fluctuate since sequence lengths
in a minibatch generally do not sum up precisely to the requested minibathSize
You notice:
- The momentum filter
is unit-gain. Every sample's gradient is distributed over time such that their sum is 1. - The learning rate is specified per sample, rather than w.r.t. an average over samples.
The specification used in other toolkits and neural-network literature is often this:
G'(t) = average { g(t-minibatchSize+1) ... g(t) } + mu * G'(t-minibatchSize)
M(t) = M(t-minibatchSize) + eta G'(t)
: gradient defined in the alternative way as a per-minibatch average and without(1-mu)
: momentum parameter, e.g. 0.9, of a non-unit-gain IIR filter, applied per minibatch -
: learning rate with minibatch-average gradient
Parameters specified in this way can be mapped to CNTK parameters using these formulas:
learningRatePerSample = eta / minibatchSize / (1-mu)
momentumAsTimeConstant = -minibatchSize / ln (mu)
You will get close to this by using learningRatePerMB
and momentumPerMB
, which are mapped as follows (notice the absence of / (1-mu)
for learningRatePerSample
learningRatePerSample = learningRatePerMB / minibatchSize
momentumAsTimeConstant = -minibatchSize / ln (momentumPerMB)
Configuration used by the ImageHandsOn tutorial with data parallelism and automatic minibatch scaling:
SGD = {
epochSize = 50000
maxEpochs = 160 ; minibatchSize = 128
learningRatesPerSample = 0.0078125*80:0.00078125*40:0.000078125
momentumAsTimeConstant = 1200
L2RegWeight = 0.0001
firstMBsToShowResult = 10 ; numMBsToShowResult = 500
parallelTrain = {
parallelizationMethod = "dataParallelSGD"
parallelizationStartEpoch = 1
distributedMBReading = true
dataParallelSGD = { gradientBits = 2 }
autoAdjust = {
autoAdjustMinibatch = true # enable automatic growing of minibatch size
minibatchSizeTuningFrequency = 10 # try to enlarge after this many epochs
numMiniBatch4LRSearch = 200
minibatchSizeTuningMax = 15000 # out of memory above this
Getting Started
Additional Documentation
How to use CNTK
Using CNTK Models in Your Code
- Overview
- Nuget Package for Evaluation
- C++ Evaluation Interface
- C# Evaluation Interface
- Evaluating Hidden Layers
- C# Image Transforms for Evaluation
- C# Multi-model Evaluation
- Evaluate in Azure
Advanced topics
Source Code & Development