AdaCLIP: Towards Pragmatic Multimodal Video Retrieval

This repo is the implementation for the paper "AdaCLIP: Towards Pragmatic Multimodal Video Retrieval".

Incorporating large image-text foundation models such as CLIP has substantially improved the performance of the multimodal video retrieval task. However, how to practically sample the frames from a video and aggregate the frame features into a video representation is still an open research question. In particular, real-world deployment scenarios, such as embodiment within consumer electronics or cloud-based inference pipelines, require two key facets of retrieval (representation building and search) to be computationally light and fast. In this paper, we propose AdaCLIP, a computationand latency-aware system for pragmatic multimodal video retrieval. AdaCLIP consists of a learning-based frame selection module to select informative frames and a query-independent frame aggregation module to obtain strong video representations from the frame features. Specifically, in the frame selection module, we introduce a differentiable Hard-Top-k algorithm to sample a subset of the frames while optimizing the performance of the video retrieval task in an end-to-end manner. Moreover, to be latency-aware, we also propose a query-independent lightweight approach, MLP-Score, to aggregate the frame features into the video representation, which offers up to 142x speedup on GPU and 822x speedup on CPU in similarity search time compared to query-dependent matching methods. Experimental results on several popular video retrieval datasets confirm the effectiveness of AdaCLIP.

Maintainers

Angela Ye ([email protected])
Zhiming Hu ([email protected])

Prerequisites

Linux (Ubuntu 16.04 or later is recommended)
Python 3.7
Packages:
- ffmpeg ($sudo apt-get install ffmpeg)
Datasets: ActivityNet Dense Captions, MSRVTT, DiDeMo

How to Install

Create a conda environment and install the appropriate packages:

$ conda create -n adaclip python=3.7 -y
$ conda activate adaclip
$ conda install -y pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
$ pip install -r requirements.txt

Update cudatoolkit=11.6 with the appropriate CUDA version on your machine.

Datasets

MSRVTT

The videos are shared by Frozen in Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

DiDeMo

The videos can be downloaded from LisaAnne/LocalizingMoments.

ActivityNet

Download the videos from the official website. The authors have made the videos available on Google and Baidu drives.

Preprocessing

Frame Extraction

Run utils/frame_extraction.py after having downloaded the dataset videos and annotations from the website. Make sure that all the videos are in the same directory (no sub-directories allowed).

python utils/frame_extraction.py /path/to/videos /path/to/frames --parallel

Subsequently, update the frames_dir parameter in the config files configs/[dataset].json.

Annotation Preprocessing

If the videos downloaded differ from the set used in the paper, run annot_preprocess/{dataset}_preprocess.py to generate train/test splits used by the dataloader. Splits used in the paper can be found in annots/.

To obtain the annotation files used to generate the splits, please download them from the following links:

MSRVTT annotations are from CLIP4Clip:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

ActivityNet annotations are from the project page of ActivityNet Captions:

wget https://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip

DiDeMo annotations have two components: annotations from the original author and the split used by Collaborative Experts.

Training

Train AdaCLIP on MSRVTT

python -m torch.distributed.launch --nproc_per_node=3 train.py --config configs/msrvtt-jsfusion.json --frames_dir /path/to/msrvtt/frames --top_k 12 --freeze_cnn --frame_agg mlp

Train AdaCLIP on DiDeMo

python -m torch.distributed.launch --nproc_per_node=3 train.py --config configs/didemo.json --frames_dir /path/to/didemo/frames --top_k 16 --num_epochs 5 --freeze_cnn --frame_agg mlp

Train AdaCLIP on ActivityNet

python -m torch.distributed.launch --nproc_per_node=3 train.py --config configs/activitynet.json --frames_dir /path/to/activitynet/frames --top_k 32 --freeze_cnn --frame_agg mlp

Train uniform sampling & mean aggregation baseline on MSRVTT

python -m torch.distributed.launch --nproc_per_node=3 train.py --config configs/msrvtt-jsfusion.json --frames_dir /path/to/msrvtt/frames --no_policy

You may need to update the batch size in the config files and/or the number of GPUs used by changing the --nproc_per_node=NUM_GPU parameter.

Inference

Inference on MSRVTT

python -m torch.distributed.launch --nproc_per_node=3 train.py --config configs/msrvtt-jsfusion.json --frames_dir /path/to/msrvtt/frames --top_k 12 --frame_agg mlp --do_inference --resume /path/to/msrvtt/checkpoint

Citation

If you use this code, please cite our paper:

@inproceedings{AdaCLIP,
    title={AdaCLIP: Towards Pragmatic Multimodal Video Retrieval},
    author={Hu, Zhiming and Ye, Angela N and Khorasgani, Salar H and Mohomed, Iqbal},
    booktitle={Proc. of the 31st ACM International Conference on Multimedia},
    year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
annot_preprocess		annot_preprocess
annots		annots
checkpoints		checkpoints
configs		configs
datasets		datasets
images		images
modeling		modeling
optimization		optimization
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdaCLIP: Towards Pragmatic Multimodal Video Retrieval

Maintainers

Prerequisites

How to Install

Datasets

MSRVTT

DiDeMo

ActivityNet

Preprocessing

Frame Extraction

Annotation Preprocessing

Training

Inference

Citation

About

Releases

Packages

Languages

License

Arbor334/AdaCLIP

Folders and files

Latest commit

History

Repository files navigation

AdaCLIP: Towards Pragmatic Multimodal Video Retrieval

Maintainers

Prerequisites

How to Install

Datasets

MSRVTT

DiDeMo

ActivityNet

Preprocessing

Frame Extraction

Annotation Preprocessing

Training

Inference

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages