🎨Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Jie Zhang*, Zhongqi Wang, Leimeng Qi, Zheng Yuan, Bei Yan, Shiguang Shan, Xilin Chen

*Corresponding Author

We propose a benchmark where we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers.

Full data will be uploaded soon~

🔥 News

[2025/1/22] Our work has been accepted by ICLR2025!

🔍 Overview

Figure 1. Overview of the automatic pipeline in Dysca for generating VQAs, cleaning VQAs and evaluating LVLMs.

Figure 2. The available subtasks of our Dysca.

Abstract - Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 24 advanced open-source LVLMs and 2 close-source LVLMs are evaluated on Dysca, revealing the drawbacks of current LVLMs.

📊 Comparison with Existing Benchmarks

Comparisons between existing LVLM benchmarks. '⍻' indicates that the benchmarks include both newly collected images / annotations and images / annotations gathered from existing datasets. '*' The scale of our released benchmark is 617K, however Dysca is able to generate unlimited data to be tested.

Benchmark	#Evaluation Data Scale	#Perceptual Tasks	Automatic Annotation	Collecting from Existing Datasets	Question Type	Automatic Evaluation
LLaVA-Bench	0.15K	-	×	⍻	Free-form	√
MME	2.3K	10	×	⍻	True-or-false	√
LVLM-eHub	-	3	√	×	Free-form	×
tiny-LVLM-eHub	2.1K	3	√	×	Free-form	√
SEED-Bench	19K	8	⍻	×	Multi-choices	√
MMBench	2.9K	12	×	⍻	Multi-choices	√
TouchStone	0.9K	10	×	√	Free-form	√
REFORM-EVAL	50K	7	√	×	Multi-choices	√
MM-BigBench	30K	6	√	×	Multi-choices	√
MM-VET	0.2K	4	⍻	⍻	Free-form	√
MLLM-Bench	0.42K	7	×	⍻	Free-form	√
SEED-Bench2	24K	10	⍻	×	Multi-choices	√
BenchLMM	2.4K	15	×	×	Free-form	√
JourneyDB	5.4K	2	√	√	Free-form, Multi-choices	√
Dysca (Ours)	617K*	20	√	√	Free-form, Multi-choices, True-or-false	√

📸 Examples of Dysca

Here are some examples of the images, prompts, questions and ground truth answers of our Dysca. These images are generated by diffusion models.

🔗 Related projects

📄 Citation

If you find this project useful in your research, please consider cite:

@misc{zhang2024dyscadynamicscalablebenchmark,
      title={Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs}, 
      author={Jie Zhang and Zhongqi Wang and Mengqi Lei and Zheng Yuan and Bei Yan and Shiguang Shan and Xilin Chen},
      year={2024},
      eprint={2406.18849},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.18849}, 
}

🤝 Feel free to discuss with us privately!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
evaluate_LVLMs		evaluate_LVLMs
figure		figure
image_generate		image_generate
pqa_generate		pqa_generate
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎨Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

🔥 News

🔍 Overview

📊 Comparison with Existing Benchmarks

📸 Examples of Dysca

🔗 Related projects

📄 Citation

About

Releases

Packages

Languages

License

Robin-WZQ/Dysca

Folders and files

Latest commit

History

Repository files navigation

🎨Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

🔥 News

🔍 Overview

📊 Comparison with Existing Benchmarks

📸 Examples of Dysca

🔗 Related projects

📄 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages