Question About Listed ViT Models in the configs/proj/flexivit/README.md #69
-
Hello, First of all, thank you very much for releasing many helpful materials and code samples of the interesting work FlexiVit. When I went through the paper, the models referred to as ViT-B-16 and ViT-B-30 seems to be the baseline ViT models trained with the fixed patch sizes (16 and 30 respectively). Moreover, accordingly, their positional embedding sizes should be 15 and 8 if I am not wrong (img_size divided by the patch_size). Thus, I was curious whether the links map to the wrong models or I misunderstood the setup mentioned in the paper regarding these models. Could you please help me with this matter. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi, thanks for your interest and the question! You almost got it. For simplicity/uniformity of implementation, we also used the "underlying" patch and posemb sizes of 32 and 7 for the baseline models. Figures 17 (b) and (c) in the appendix show that this change has absolutely no effect on the results even for regular (not flexi) ViT models. So, for the patch embeddings you can just resize them to 16 and 30 at load-time with PI-resize, and for the position embedding, resize them the usual way at load time, i.e. (bi)linear interpolation, the code does these here: https://github.com/google-research/big_vision/blob/main/big_vision/models/proj/flexi/vit.py#L198-L206 To be clear, I did not go and double-check the checkpoints now (as I think I did check when originally uploading them), so do let me know if they somehow don't work. |
Beta Was this translation helpful? Give feedback.
-
Hi, When I carefully checked the relevant parts of the paper and the code portion you provided, now all makes sense to me. |
Beta Was this translation helpful? Give feedback.
Hi, thanks for your interest and the question!
You almost got it. For simplicity/uniformity of implementation, we also used the "underlying" patch and posemb sizes of 32 and 7 for the baseline models. Figures 17 (b) and (c) in the appendix show that this change has absolutely no effect on the results even for regular (not flexi) ViT models.
So, for the patch embeddings you can just resize them to 16 and 30 at load-time with PI-resize, and for the position embedding, resize them the usual way at load time, i.e. (bi)linear interpolation, the code does these here: https://github.com/google-research/big_vision/blob/main/big_vision/models/proj/flexi/vit.py#L198-L206
To be clear, I did not go a…