Skip to content

A benchmark and analysis for fine-grained visual comprehension (FGVC) tasks in large vision language models (LVLMs).

License

Notifications You must be signed in to change notification settings

wjdghks950/Finer

Repository files navigation

Finer: A Benchmark for Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [Paper]
Jeonghwan Kim, Heng Ji

Contents

Code License Data License Usage and License Notices: This dataset inherits the Usage and License Notices from LLaVA (Liu et al., 2023). The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Install

  1. Clone the current repository to reproduce the Finer evaluation
git clone https://github.com/wjdghks950/Finer.git
cd Finer
  1. Install and setup the llava conda environment from the official LLaVA repository to run LLaVA.

  2. Install and setup the lavis conda environment from the official LAVIS repository to run InstructBLIP and BLIP-2.

Dataset

  • First, set up a separate data directory in the same directory as the LLaVA and LAVIS dirs.
  • Download the datasets for each of the dataset from the following link and structure them into the format below, where ... indicates the downloaded dataset files including images and their annotations:
├── inaturalist
│   └── ...
├── fgvc-aircraft-2013b
│   └── ...
├── CUB_200_2011
│   └── ...
├── nabirds
│   └── ...
└── stanford_cars
│   └── ...
└── stanford_dogs
│   └── ...
  • If you'd like to set the datasets up yourself, under the data directory, set up the following directories separately
    • data/inaturalist - Download the evaluation dataset images / annotations and the training dataset images / annotations;
    • data/fgvc-aircraft-2013b - Download the dataset from the following link: dataset link
    • data/nabirds - Download the dataset from the following link: dataset link. You need to agree to the Terms of Use and get the downloadable link manually; follow the instructions in the nabirds dataset link.
    • data/CUB_200_2011 - Download the dataset from the following link: dataset link
    • data/stanford_dogs - Download the dataset from the following link: images / annotations / train/test_split
    • data/stanford_cars - Download the dataset from the following Kaggle link: dataset
  • In each of the dataset (e.g., data/stanford_cars) there is a concept to attribute dictionary of file format (parsed-{dataset_name}-{model_name}-wiki-text-combined.json) in the following format:
    {
        "id": 41,
        "name": "Acura ZDX Hatchback 2012",
        "attr_binomial": {
            "required": [
                "Five-door coupe-like hatchback body style",
                "Unique, sloping roofline that tapers towards the rear",
                "Shield-shaped front grille with the Acura logo",
                "Angular headlight design with integrated daytime running lights",
                "Distinctive, raised rear end with a high-mounted spoiler",
                "Dual exhaust outlets at the rear",
                "Sharp character lines along the sides"
            ],
            "likely": [
                "LED taillights",
                "19-inch alloy wheels",
                "Panoramic glass roof",
                "Chrome door handle accents",
                "Body-colored side mirrors with integrated turn signals",
                "Sculpted hood design"
            ]
        }
    }
  • For the superordinate, coarse-level and fine-level labels, within each data/{dataset_name} folder, there is a file name unified-{dataset_name}-{split}-combined.jsonl
{
    "idx": 278, 
    "basic-level-lbl": "Airplane", 
    "coarse-level-lbl": ["Boeing"], 
    "fine-level-lbl": ["Boeing 737", "737-800"], 
    "img_path": "{data_dir_path}/fgvc-aircraft-2013b/data/images/1935750.jpg", "metadata": {}}

Instruction-Tuning

Fine-tune the LLaVA-v1.5 (7B) model using the finer-mixture (LLaVA/playground/data/llava_v1_5_mix865k_attr_gen_fine_answer_inaturalist.json), which was built on top of the llava instruction-tuning mixture, as follows:

cd LLaVA/scripts/v1_5
./finetune_lora.sh

It took approximately ~28 hours to fine-tune the LLaVA-v1.5(7B) on the finer-mixture on 4 V100s (16G)

Citation

If you find Finer useful for your research and applications, please cite using this BibTeX:

@misc{kim2024finer,
      title={Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models}, 
      author={Kim, Jeonghwan and Ji, Heng},
      publisher={arXiv:2402.16315},
      year={2024},
}

About

A benchmark and analysis for fine-grained visual comprehension (FGVC) tasks in large vision language models (LVLMs).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published