Skip to content

OpenAI's CLIP model implementation. It uses ViT as Image Encode and BERT like transformer Encoder as Text Encoder.

License

Notifications You must be signed in to change notification settings

tanwanirahul/CLIP_from_scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLIP

OpenAI's CLIP model implementation

CLIP was one of the earliest vision language models that was widely adopted and paved the way for multimodal models evolution. Before CLIP, all computer vision based systems were built to classify some fixed set of categories. To use those systems in another domain required retraining/fine-tuning the model on a domain specific dataset. The key contribution by the CLIP was its ability to do zero-shot transfer to newer domains that highlighted CLIP's robustness against data distribution shift.

Paper: Learning Transferable Visual Models From Natural Language Supervision

CLIP achieves this by combining the language and vision into a single model architecture as shown in the image below.

CLIP

CLIP uses Image and Text encoders to learn correct representations for images and text respectively. These embeddings are jointly trained with the contrastive loss strategy to maximize the cosine similarity between the relevant <image, text> pairs and minimize the same between irrelevant pairs.

Below is the Pseudocode describing the high-level working of CLIP:

CLIP Pseudocode

Implementation Details:

  1. Image Encoder - While the paper describes experiments with Resnet and ViT family of models, in our implementation we have used ViT (Vision Transformer as Image Encoder)

  2. Text Encoder - Standard BERT like transformer encoder is used for text encoding (unlike BERT, CLIP's text encoder uses causal mask in its self attention mechanism).

  3. We initialize the weights of our model with the pre-trained weights from huggingface.

  4. The implementation doesn't include the training loop, and hence could only be used for inference.

  5. To confirm the correctness of the implementation, it contains validate.py to test the implementation against the CLIP as implemented by HuggingFace's transformer library.

About

OpenAI's CLIP model implementation. It uses ViT as Image Encode and BERT like transformer Encoder as Text Encoder.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages