Authors: Zhuang Liu, Zhiqiu Xu, Joseph Jin, Zhiqiang Shen, And Trevor Darrell.
Authors in this paper demonstrate that dropout can also mitigate underfitting when used at the start of training.
Dropout randomly deactivates each neuron with probability
Authors in this paper demonstrate an alternative use of dropout for tackling underfitting. They begin their investigation into dropout training dynamics by making an intriguing observation on gradient norms, which then leads to key empirical findings: during the initial stages of training, dropout reduces gradient variance across mini-batches and allows the model to update in more consistent directions. These directions are also more aligned with the entire dataset's gradient direction. Consequently, the model can optimize the training loss more effectively with respect to the whole training set, rather than being swayed by individual mini-batches. In other words, dropout counteracts SGD and prevents excessive regularization due to randomness in sampling mini-batches during early training.
Based on this, authors introduce early dropout which is only used during early training to help underfitting models fit better.
Early dropout lowers the final training loss compared to no dropout and standard dropout. Authors also propose for models that already use standard dropout to remove dropout during earlier training epochs to mitigate overfitting and they refer to this approach as late dropout and demonstrate that it improves generalization accuracy for large models.
In this study authors compare two ViT-T/16 training processes on ImageNet: one without dropout as a baseline and the other with a 0.1 dropout rate.
Gradient Norm. Authors investigated the impact of dropout on the strength of gradients
Model Distance. Since the gradient steps are smaller, the dropout model is expected to travel a smaller distance from its initial point than the baseline model.
To measure the distance between the two models, authors use the
Gradient Direction Variance. Authors hypothesize the same for the two models: the dropout model is producing more consistent gradient directions across mini-batches. To test this, authors collected a set of mini-batch gradients
$Eqution GDV
Gradient Direction Error. However, what should be the correct direction to take? To fit the training data, the underlying objective is to minimize the loss on the entire training set, not just on any single mini-batch. Thereby authors compute the gradient for a given model on the whole training set, where dropout is set to inference mode to capture the full model's gradient. Then evaluate how far the actual mini-batch gradient
Bias-variance Tradeoff. This analysis at early training can be viewed through the lens of the bias-variance tradeoff. For no-dropout models, an SGD mini-batch provides an unbiased estimate of the whole-dataset gradient is equal to the whole-dataset gradient. However, with dropout, the estimate becomes more or less biased, as the mini-batch gradient are generated by different sub-networks, whose expected gradient may not match the full network's gradient. Nevertheless, the gradient variance is significantly reduced, leading to a reduction in gradient error. Intuitively, this reduction in variance and error helps prevent the model from overfitting to specific batches, especially during the early stages of training when the model is undergoing significant changes.
Based on previous analysis, authors know that using dropout early can potentially improve the model's ability to fit the training data.
Whether it is desirable to fit the training data better depends on whether the model is in an underfitting or overfitting regime, which can be difficult to define precisely. Authors considered if a model generalizes better with standard dropout then it's in an overfitting regime, and if the model performs better without dropout, then consider it to be in an underfitting regime.
In their default settings, models at underfitting regimes do not use dropout. To improve their ability to fit the training data, authors proposed early dropout, using dropout before a certain iteration, and then disabling it for the rest of training. which resulted in reducing final training loss and improves accuracy.
Overfitting models already have standard dropout included in their training settings. During the early stages of training, dropout may cause overfitting unintentionally, which is not desirable. To reduce overfitting, authors propose late dropout, not using dropout before a certain iteration, and then using it for the rest of training. This is a symmetric approach to early dropout.
1) number of epochs to wait before turning dropout on or off: results show that this choice can be robust enough to vary from 1% to 50% of the total epochs.
2) Drop rate