Can you trust your AI radiologist?
Probing the reliability of
multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

MediConfusion is a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical Multimodal Large Language Models (MLLMs) from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. These are some examples of confusing image pairs from the ROCO radiology dataset:

Our benchmark consists of 176 confusing pairs. A confusing pair is a set of two images that share the same question and corresponding answer options, but the correct answer is different for the images.

We evaluate models based on their ability to answer both questions correctly within a confusing pair, which we call set accuracy. This metric indicates how well models can tell the two images apart, as a model that selects the same answer option for both images for all pairs will receive 0% set accuracy. We also report confusion, a metric that describes the proportion of confusing pairs where the model has chosen the same answer option for both images.

Strikingly, all available models (open-source or proprietary) achieve performance below random guessing on MediConfusion, raising serious concerns about the reliability of existing medical MLLMs for healthcare deployment.

📊 Leaderboard

Rank	Model	Version	Set acc. (%)	Confusion (%)
🏅️	Random Guessing	-	25.00	50.00
🥈	Gemini	1.5 Pro	19.89	58.52
🥉	GPT	4o (release 20240513)	18.75	75.00
4	Llama 3.2	90B-Vision-Instruct	15.34	78.41
5	InstructBLIP	Vicuna 7B	12.50	80.35
6	Molmo	7B-D-0924	9.66	86.21
7	LLaVA	v1.6-Mistral 7B	9.09	85.80
8	Claude	3 Opus	8.52	84.09
9	BLIP-2	Opt 2.7B	6.82	86.93
10	Molmo	72B-0924	6.82	85.80
11	RadFM	-	5.68	85.80
12	Med-Flamingo	-	4.55	98.30
13	LLaVA-Med	v1.5-Mistral 7B	1.14	97.16

Updates

[2024/09/11] Molmo family added to the supported models.
[2024/03/11] Llama 3.2 family added to the supported models.

🔧 Requirements

Create and activate a conda environment with the following command:

conda create -n "mediconfusion" python=3.10
conda activate mediconfusion

Use the following code to install requirements:

pip install -r requirements.txt

If you have any problem using the models, please follow the instructions below.

Data Download

The images in MediConfusion have to be downloaded directly from the source due to their license. To download all images (26 MB), use the following command:

python scripts/download.py

The images can also be downloaded directly from ROCO (set local_image_address to False). In this case, set data_path to the download folder when running the evaluation script (more details in Usage).

Open-source Models

LLaVA-Med: Follow the instructions here and install LLaVA-Med. Download the model from here and set model_path in the config to its folder.
LLaMA 3.2: To download this model you should get access by requesting in here. Then, add your token to the config. If you encountered CUDA memory error, set device to auto.
Molmo: If you encountered CUDA memory error, set device to auto.
LLaMA: Download the model from here and set LLaMa_PATH in the MedFlamingo config to its folder.
MedFlamingo: Download the model from here and set CHECKPOINT_PATH in the config to its folder.
RadFM: Download the model from here and set model_path in the config to its folder.

Proprietary Models

To use proprietary models, save your API keys in the root directory of the repo in a file named .env, including the keys as in the example below.

GEMINI_API_KEY=YOUR_KEY
AZURE_OPENAI_API_KEY=YOUR_KEY
AZURE_OPENAI_ENDPOINT=YOUR_KEY
ANTHROPIC_API_KEY=YOUR_KEY

Package Versions

Different MLLMs need different versions of the transformers package. Please use the following versions for each MLLM.

LLaVA-Med: Use transformers==4.36.2
RadFM: Use transformers==4.28.1
MedFlamingo: Use transformers==4.44.2 and install open-flamingo package
Gemini: You need python>=3.9
Other MLLMs: Use transformers==4.44.2 and python>=3.8

🔰 Usage

Evaluation

Before using the code, make sure to follow the instructions in Requirements.
You can create/change model configurations in configs/MODEL_NAME/.
To use the evaluation code, use the following command:

python scripts/answering.py --mllm_name MODEL_NAME --mode MODE

The results will be saved in Results/MODEL_NAME/. You will see two files: one containing the final scores and one containing the answers provided by the model.
After runing answering.py you can print the results again with the command below:

python scripts/printing.py --mllm_name MODEL_NAME --mode MODE

Arguments

mode: This sets the evaluation method. Available options are gpt4 (FF), mc (MC), greedy (GD), and prefix (PS). For proprietary models, you can only use the first two methods.
mllm_name: This is the name of your desired MLLM. Available options are gpt (GPT-4o), gemini (Gemini 1.5 Pro), claude (Claude 3 Opus), llava (LLaVA), blip2 (BLIP-2), intructblip (InstructBLIP), llava_med (LLaVA-Med), radfm (RadFM), and med_flamingo (Med-Flamingo).
model_args_path (default: configs/MLLM_NAME/vanilla.json): Path to the model's configuration file.
tr (default: 3): Threshold used for FF evaluation to select an option. If the difference between assigned scores is at least tr, we select the option with the higher score.
resume_path (default: None): If your run is interrupted and you want to resume evaluation, you should set this argument to the path to the answers of the previous run.
local_image_address (default: True): If Flase, the code looks for the images based on their ROCO IDs. Otherwise, it looks for the images based on their local IDs.
data_path (default: ./data/images): Path to the images. If you download the images using our script, this is ./data/images. If you are not using local addressing, this is the path to the ROCO.
device (default: cuda): You can use cuda or cpu. For LLaVA-Med, our code does not support cpu.

📌 Citation

If you use this code or our dataset, please cite our paper.

@article{sepehri2024mediconfusion,
  title={MediConfusion: Can You Trust Your AI Radiologist? Probing the reliability of multimodal medical foundation models},
  author={Sepehri, Mohammad Shahab and Fabian, Zalan and Soltanolkotabi, Maryam and Soltanolkotabi, Mahdi},
  journal={arXiv preprint arXiv:2409.15477},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Models		Models
assets		assets
configs		configs
data		data
scripts		scripts
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can you trust your AI radiologist?
Probing the reliability of
multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

📊 Leaderboard

Updates

📖 Table of Contents

🔧 Requirements

Data Download

Open-source Models

Proprietary Models

Package Versions

🔰 Usage

Evaluation

Arguments

📌 Citation

About

Releases

Packages

Contributors 2

Languages

MShahabSepehri/MediConfusion

Folders and files

Latest commit

History

Repository files navigation

Can you trust your AI radiologist?Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

📊 Leaderboard

Updates

📖 Table of Contents

🔧 Requirements

Data Download

Open-source Models

Proprietary Models

Package Versions

🔰 Usage

Evaluation

Arguments

📌 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Can you trust your AI radiologist?
Probing the reliability of
multimodal medical foundation models

Packages