Skip to content

MShahabSepehri/MediConfusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

drawing

Can you trust your AI radiologist?
Probing the reliability of
multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

| 🤗 Hugging Face | 📄 Paper | 🌐 Blog |

License

MediConfusion is a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical Multimodal Large Language Models (MLLMs) from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. These are some examples of confusing image pairs from the ROCO radiology dataset:

drawing

Our benchmark consists of 176 confusing pairs. A confusing pair is a set of two images that share the same question and corresponding answer options, but the correct answer is different for the images.

drawing

We evaluate models based on their ability to answer both questions correctly within a confusing pair, which we call set accuracy. This metric indicates how well models can tell the two images apart, as a model that selects the same answer option for both images for all pairs will receive 0% set accuracy. We also report confusion, a metric that describes the proportion of confusing pairs where the model has chosen the same answer option for both images.

Strikingly, all available models (open-source or proprietary) achieve performance below random guessing on MediConfusion, raising serious concerns about the reliability of existing medical MLLMs for healthcare deployment.

📊 Leaderboard

Rank Model Version Set acc. (%) Confusion (%)
🏅️ Random Guessing - 25.00 50.00
🥈 Gemini 1.5 Pro 19.89 58.52
🥉 GPT 4o (release 20240513) 18.75 75.00
4 Llama 3.2 90B-Vision-Instruct 15.34 78.41
5 InstructBLIP Vicuna 7B 12.50 80.35
6 Molmo 7B-D-0924 9.66 86.21
7 LLaVA v1.6-Mistral 7B 9.09 85.80
8 Claude 3 Opus 8.52 84.09
9 BLIP-2 Opt 2.7B 6.82 86.93
10 Molmo 72B-0924 6.82 85.80
11 RadFM - 5.68 85.80
12 Med-Flamingo - 4.55 98.30
13 LLaVA-Med v1.5-Mistral 7B 1.14 97.16

Updates

  • [2024/09/11] Molmo family added to the supported models.
  • [2024/03/11] Llama 3.2 family added to the supported models.

📖 Table of Contents

🔧 Requirements

Create and activate a conda environment with the following command:

conda create -n "mediconfusion" python=3.10
conda activate mediconfusion

Use the following code to install requirements:

pip install -r requirements.txt

If you have any problem using the models, please follow the instructions below.

Data Download

The images in MediConfusion have to be downloaded directly from the source due to their license. To download all images (26 MB), use the following command:

python scripts/download.py

The images can also be downloaded directly from ROCO (set local_image_address to False). In this case, set data_path to the download folder when running the evaluation script (more details in Usage).

Open-source Models

  • LLaVA-Med: Follow the instructions here and install LLaVA-Med. Download the model from here and set model_path in the config to its folder.
  • LLaMA 3.2: To download this model you should get access by requesting in here. Then, add your token to the config. If you encountered CUDA memory error, set device to auto.
  • Molmo: If you encountered CUDA memory error, set device to auto.
  • LLaMA: Download the model from here and set LLaMa_PATH in the MedFlamingo config to its folder.
  • MedFlamingo: Download the model from here and set CHECKPOINT_PATH in the config to its folder.
  • RadFM: Download the model from here and set model_path in the config to its folder.

Proprietary Models

To use proprietary models, save your API keys in the root directory of the repo in a file named .env, including the keys as in the example below.

GEMINI_API_KEY=YOUR_KEY
AZURE_OPENAI_API_KEY=YOUR_KEY
AZURE_OPENAI_ENDPOINT=YOUR_KEY
ANTHROPIC_API_KEY=YOUR_KEY

Package Versions

Different MLLMs need different versions of the transformers package. Please use the following versions for each MLLM.

  • LLaVA-Med: Use transformers==4.36.2
  • RadFM: Use transformers==4.28.1
  • MedFlamingo: Use transformers==4.44.2 and install open-flamingo package
  • Gemini: You need python>=3.9
  • Other MLLMs: Use transformers==4.44.2 and python>=3.8

🔰 Usage

Evaluation

Before using the code, make sure to follow the instructions in Requirements.
You can create/change model configurations in configs/MODEL_NAME/.
To use the evaluation code, use the following command:

python scripts/answering.py --mllm_name MODEL_NAME --mode MODE

The results will be saved in Results/MODEL_NAME/. You will see two files: one containing the final scores and one containing the answers provided by the model.
After runing answering.py you can print the results again with the command below:

python scripts/printing.py --mllm_name MODEL_NAME --mode MODE

Arguments

  • mode: This sets the evaluation method. Available options are gpt4 (FF), mc (MC), greedy (GD), and prefix (PS). For proprietary models, you can only use the first two methods.
  • mllm_name: This is the name of your desired MLLM. Available options are gpt (GPT-4o), gemini (Gemini 1.5 Pro), claude (Claude 3 Opus), llava (LLaVA), blip2 (BLIP-2), intructblip (InstructBLIP), llava_med (LLaVA-Med), radfm (RadFM), and med_flamingo (Med-Flamingo).
  • model_args_path (default: configs/MLLM_NAME/vanilla.json): Path to the model's configuration file.
  • tr (default: 3): Threshold used for FF evaluation to select an option. If the difference between assigned scores is at least tr, we select the option with the higher score.
  • resume_path (default: None): If your run is interrupted and you want to resume evaluation, you should set this argument to the path to the answers of the previous run.
  • local_image_address (default: True): If Flase, the code looks for the images based on their ROCO IDs. Otherwise, it looks for the images based on their local IDs.
  • data_path (default: ./data/images): Path to the images. If you download the images using our script, this is ./data/images. If you are not using local addressing, this is the path to the ROCO.
  • device (default: cuda): You can use cuda or cpu. For LLaVA-Med, our code does not support cpu.

📌 Citation

If you use this code or our dataset, please cite our paper.

@article{sepehri2024mediconfusion,
  title={MediConfusion: Can You Trust Your AI Radiologist? Probing the reliability of multimodal medical foundation models},
  author={Sepehri, Mohammad Shahab and Fabian, Zalan and Soltanolkotabi, Maryam and Soltanolkotabi, Mahdi},
  journal={arXiv preprint arXiv:2409.15477},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages