BayesianVSLNet

🔜: We will release checkpoints and pre-extracted video features.

Challenge

The challenge is built over Ego4d-GoalStep dataset and code.

Goal: Given an untrimmed egocentric video, identify the temporal action segment corresponding to a natural language description of the step. Specifically, predict the (start_time, end_time) for a given keystep description.

You will find in the leaderboard 🚀 the results in the test set for the best approaches. Our method is currently in the first place 🚀🔥.

We build our approach BayesianVSLNet: Bayesian temporal-order priors for test time refinement. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inference, which adjusts for cyclic and repetitive actions within video, enhancing the accuracy of moment predictions. Please, review the paper for further details.

Install

git clone https://github.com/cplou99/BayesianVSLNet
pip install -r requirements.txt

Video Features

We use both Omnivore-L and EgoVLPv2 video features. They should be pre-extracted and located at ./ego4d-goalstep/step-grounding/data/features/.

Model

It is necessary to locate the EgoVLPv2 weights to extract text features in BayesianVSLNet/NaQ/VSLNet_Bayesian/model/EgoVLP_weights.

Train

cd ego4d-goalstep/step_grounding/
bash train_Bayesian.sh experiments/

Inference

cd ego4d-goalstep/step_grounding/
bash infer_Bayesian.sh experiments/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BayesianVSLNet - Ego4D Step Grounding Challenge CVPR24 🏆

Challenge

BayesianVSLNet

Install

Video Features

Model

Train

Inference

Files

README.md

Latest commit

History

README.md

File metadata and controls

BayesianVSLNet - Ego4D Step Grounding Challenge CVPR24 🏆

Challenge

BayesianVSLNet

Install

Video Features

Model

Train

Inference