Authors: Manoj Kumar, Mostafa Dehghani, Neil Houlsby.
Authors propose Dual PatchNorm which is a two-layer normalization layer before and after the patch embedding layer in vision transformers.
Authors try to improve ViT models. first, they tried different orders of LayerNorm, but they didn't succeed as they found that pre-LN strategy in ViT is close to optimal.
But they observed that placing additional LayerNorms before and after the standard ViT-projection layer, which they call Dual PatchNorm (DPN), can improve significantly over well-tuned ViT baselines.
They did experiment with three different datasets and it demonstrate the efficacy of DPN. They also observed that the LayerNorm scale parameters upweight the pixels at the center and corners of each patch.
hp, wp = patch_size[0], patch_size[1]
x = einops.rearrange(
x, "b (ht hp) (wt wp) c -> b (ht wt) (hp wp c)", hp=hp, wp=wp)
x = nn.LayerNorm(name="ln0")(x)
x = nn.Dense(output_features, name="dense")(x)
x = nn.LayerNorm(name="ln1")(x)
Vision Transformer consists of a patch embedding layer (PE) followed by a stack of Transformer blocks. The PE layer first rearranges the image
Given a sequence of
(1)
(2)
where
The first Equation normalizes each patch
ViTs incorporate LayerNorm before every self-attention and MLP layer, commonly known as the pre-LN strategy. For each of the self-attention and MLP layers, authors evaluated 3 strategies: 1- Place LayerNorm before (pre-LN), 2- after (post-LN), and before and after (pre+post-LN) leading to nine different combinations.
Instead of adding LayerNorms to the Transformer block, authors also propose to apply LayerNorms in the stem alone, both before and after the patch embedding layer. In particular, they replace
Authors assess three alternate strategies: Pre, Post and Post Pos. Pre applies LayerNorm only to the inputs, Post only to the outputs, and Post PosEmb to the outputs after being summed with positional embeddings.
Nex table displays the accuracy gains with two alternate strategies: Pre is unstable on B/32 leading to a significant drop in accuracy. Additionally, Pre obtains minor drops in accuracy on S/32 and Ti/16.
Post and Post PosEmb achieve worse performance on smaller outputs of the embedding layer is necessary to obtain consistent improvements in accuracy across all ViT variants.