You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where γs denotes a scaling factor initialized as zeros,where γc is the channel-wise scaling factor initialized as zeros. Excuse me, how are the scaling factors γs and γc set in the specific experiment?
The text was updated successfully, but these errors were encountered:
Hi, @zzr123000, thanks for your insightful question. The initialization method of scaling factors is taken from the LayerScale (or ResScale), where the branch with scaling factors should be "removed" to keep the shortcut branch as the identity. It could be beneficial to network optimization when the module has complex branches. Moreover, we haven't considered different initialization of scaling factors for the general of some specific cases. Maybe the zero initialization for pre-training and using the trained model for fine-tuning is already covered the most cases.
where γs denotes a scaling factor initialized as zeros,where γc is the channel-wise scaling factor initialized as zeros. Excuse me, how are the scaling factors γs and γc set in the specific experiment?
The text was updated successfully, but these errors were encountered: