Skip to content

Does Text Matter: Extending CLIP with OCR and NLP

Notifications You must be signed in to change notification settings

jordisassoon/OSBC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSBC

This is the official implementation repository of the paper "Does Text Matter? Extending OCR with TrOCR and NLP for Image Classification and Retrieval".

OSBC (OCR Sentence BERT CLIP) is a novel architecture which extends CLIP with a text extraction pipeline composed of an OCR model (TrOCR or PyTesseract) and SBERT. OSBC focuses on leveraging inner text as an additional feature for image classification and retrieval.

Setup

Environment

For starters, we need to recreate the environment. We used a Linux machine for development, so other OS users might need to make some workarounds.

To install all the dependencies:

conda env create -f environment.yml

and don't forget to add the spacy english dictionary:

python -m spacy download en_core_web_sm

Depending on your OS, PyTesseract will have to be installed in different ways. Refer to their repo for more information.

Data

We use the following datasets: Flickr8k, Standard OCR, MNIST, CIFAR-10, and a custom Dilbert dataset.

The MNIST and CIFAR-10 datasets will be automatically downloaded the first time you run the pipeline, as we used the online torchvision datasets.

For all the other datasets, you have to manually download them in the data directory and refactor them. The data directory should look something like this:

.
├── characters
│   ├── train
│   └── validation
├── dilbert
│   ├── captions.csv
│   └── Images
└── flickr8k
    ├── captions.csv
    └── Images

The Characters dataset is a subset of the Standard OCR dataset, which only includes the english letters folders.
The Dilbert dataset is avaliable upon request.

Evaluate

To evaluate one of the pipelines we use in the paper, run the following command:

python main.py {task name} {dataset name} {model flag} {optional model versions}

For example, to evaluate a custom OSBC model on characters:

python main.py classification characters --eval_osbc=True --clip_model="openai/clip-vit-base-patch32" --ocr_model="microsoft/trocr-base-printed"

The list of all possible parameters:

task: classification, retrieval 
dataset: characters, mnist, cifar, flickr8k, dilbert 
--eval_osbc: Bool (default: False)
--eval_clip: Bool (default: False)
--eval_os: Bool (default: False)
--clip_model: openai/clip-vit-base-patch16, openai/clip-vit-base-patch32, openai/clip-vit-large-patch14 (default: openai/clip-vit-base-patch32)
--ocr_model: microsoft/trocr-base-printed, microsoft/trocr-base-handwritten, psm 6, psm 10 (default: microsoft/trocr-base-printed)

You must use the right task and dataset combination, and choose at least one model flag.

Finetuning CLIP

To run finetuning, head over to the finetuner directory. Here the there is a config file to run the training with. This is a temporary implementation of the VisionTextDualEncoder repo for our use case. We also tracked our experiments on wandb.

Once training is complete, the model will be saved in the models/finetuned directory. You can use the CLIP model on the command line just like the previous ones, simply by specifying the directory in which it's in.

python main.py classification characters --eval_clip=True --clip_model="models/finetuned/{dir and name of the model}"

Contributors

This repo was created by Jordan Sassoon
For any questions, feel free to reach out.

About

Does Text Matter: Extending CLIP with OCR and NLP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages