Fine-tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

Mohammad Fahes¹, Tuan-Hung Vu^1,2, Andrei Bursuc^1,2, Patrick Pérez³, Raoul de Charette¹
¹ Inria, Paris, France.

² valeo.ai, Paris, France.

³ Kyutai, Paris, France.

TL; DR: CLIP projects visual embedding to the shared latent space using a linear projection layer. We show that simply fine-tuning this guy (:p) can be a strong alternative to linear probing, prompt tuning and CLIP-adapters, and performs also well on test-time adaptation.

Stay tuned for the code!

Paper: https://arxiv.org/abs/2410.05270

ProLIP

We fine-tune the pretrained linear projection layer of the vision encoder with a regularization loss towards the pre-trained weights.

Citation

@article{fahes2024fine,
  title={Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia},
  author={Fahes, Mohammad and Vu, Tuan-Hung and Bursuc, Andrei and P{\'e}rez, Patrick and de Charette, Raoul},
  journal={arXiv preprint arXiv:2410.05270},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fine-tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

ProLIP

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fine-tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

ProLIP

Citation