Finer: A Benchmark for Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [Paper]
Jeonghwan Kim, Heng Ji

Usage and License Notices: This dataset inherits the Usage and License Notices from LLaVA (Liu et al., 2023). The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Install

Clone the current repository to reproduce the Finer evaluation

git clone https://github.com/wjdghks950/Finer.git
cd Finer

Install and setup the llava conda environment from the official LLaVA repository to run LLaVA.
Install and setup the lavis conda environment from the official LAVIS repository to run InstructBLIP and BLIP-2.

Dataset

First, set up a separate data directory in the same directory as the LLaVA and LAVIS dirs.
Download the datasets for each of the dataset from the following link and structure them into the format below, where ... indicates the downloaded dataset files including images and their annotations:

├── inaturalist
│   └── ...
├── fgvc-aircraft-2013b
│   └── ...
├── CUB_200_2011
│   └── ...
├── nabirds
│   └── ...
└── stanford_cars
│   └── ...
└── stanford_dogs
│   └── ...

If you'd like to set the datasets up yourself, under the data directory, set up the following directories separately
- data/inaturalist - Download the evaluation dataset images / annotations and the training dataset images / annotations;
- data/fgvc-aircraft-2013b - Download the dataset from the following link: dataset link
- data/nabirds - Download the dataset from the following link: dataset link. You need to agree to the Terms of Use and get the downloadable link manually; follow the instructions in the nabirds dataset link.
- data/CUB_200_2011 - Download the dataset from the following link: dataset link
- data/stanford_dogs - Download the dataset from the following link: images / annotations / train/test_split
- data/stanford_cars - Download the dataset from the following Kaggle link: dataset
In each of the dataset (e.g., data/stanford_cars) there is a concept to attribute dictionary of file format (parsed-{dataset_name}-{model_name}-wiki-text-combined.json) in the following format:

    {
        "id": 41,
        "name": "Acura ZDX Hatchback 2012",
        "attr_binomial": {
            "required": [
                "Five-door coupe-like hatchback body style",
                "Unique, sloping roofline that tapers towards the rear",
                "Shield-shaped front grille with the Acura logo",
                "Angular headlight design with integrated daytime running lights",
                "Distinctive, raised rear end with a high-mounted spoiler",
                "Dual exhaust outlets at the rear",
                "Sharp character lines along the sides"
            ],
            "likely": [
                "LED taillights",
                "19-inch alloy wheels",
                "Panoramic glass roof",
                "Chrome door handle accents",
                "Body-colored side mirrors with integrated turn signals",
                "Sculpted hood design"
            ]
        }
    }

For the superordinate, coarse-level and fine-level labels, within each data/{dataset_name} folder, there is a file name unified-{dataset_name}-{split}-combined.jsonl

{
    "idx": 278, 
    "basic-level-lbl": "Airplane", 
    "coarse-level-lbl": ["Boeing"], 
    "fine-level-lbl": ["Boeing 737", "737-800"], 
    "img_path": "{data_dir_path}/fgvc-aircraft-2013b/data/images/1935750.jpg", "metadata": {}}

Instruction-Tuning

Fine-tune the LLaVA-v1.5 (7B) model using the finer-mixture (LLaVA/playground/data/llava_v1_5_mix865k_attr_gen_fine_answer_inaturalist.json), which was built on top of the llava instruction-tuning mixture, as follows:

cd LLaVA/scripts/v1_5
./finetune_lora.sh

It took approximately ~28 hours to fine-tune the LLaVA-v1.5(7B) on the finer-mixture on 4 V100s (16G)

Citation

If you find Finer useful for your research and applications, please cite using this BibTeX:

@misc{kim2024finer,
      title={Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models}, 
      author={Kim, Jeonghwan and Ji, Heng},
      publisher={arXiv:2402.16315},
      year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 369 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
images		images
llava		llava
playground/data		playground/data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finer: A Benchmark for Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

Contents

Install

Dataset

Instruction-Tuning

Citation

About

Releases

Packages

Languages

License

wjdghks950/Finer

Folders and files

Latest commit

History

Repository files navigation

Finer: A Benchmark for Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

Contents

Install

Dataset

Instruction-Tuning

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages