scaling factors γs and γc #26

zzr123000 · 2024-10-31T04:34:59Z

where γs denotes a scaling factor initialized as zeros，where γc is the channel-wise scaling factor initialized as zeros. Excuse me, how are the scaling factors γs and γc set in the specific experiment?

Lupin1998 · 2025-01-16T19:17:13Z

Hi, @zzr123000, thanks for your insightful question. The initialization method of scaling factors is taken from the LayerScale (or ResScale), where the branch with scaling factors should be "removed" to keep the shortcut branch as the identity. It could be beneficial to network optimization when the module has complex branches. Moreover, we haven't considered different initialization of scaling factors for the general of some specific cases. Maybe the zero initialization for pre-training and using the trained model for fine-tuning is already covered the most cases.

Lupin1998 self-assigned this Jan 16, 2025

Lupin1998 added the question Further information is requested label Jan 16, 2025

Lupin1998 mentioned this issue Jan 16, 2025

What do the two Subtract operations mean? #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling factors γs and γc #26

scaling factors γs and γc #26

zzr123000 commented Oct 31, 2024

Lupin1998 commented Jan 16, 2025

scaling factors γs and γc #26

scaling factors γs and γc #26

Comments

zzr123000 commented Oct 31, 2024

Lupin1998 commented Jan 16, 2025