Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, and Salman Khan

Mohamed bin Zayed University of AI (MBZUAI)

🔥 Updates

(Feb 03, 2025)
- Adversarial evaluation codes are released.
- Robust-LLaVA-H and Robust-LLaVA-G released: Excited to release the new integration of LLaVA with large-scale robust image encoders, ViT-H and ViT-G, respectively. 🔥🔥

Robust score of Robust-LLaVA⁴ on downstream vision-language tasks with adversarial examples crafted at ε = 4/255: The original CLIP integrated into LLaVA exhibits minimal robustness. Our proposed Robust-LLaVA⁴ outperforms state-of-the-art robust CLIP models, such as FARE⁴ and Sim-CLIP⁴ in robustness score across all tasks and diverse datasets, while maintaining high clean accuracy. (Accuracy is reported for VQAv2 and TextVQA, while CIDER score is reported for Flickr30k and COCO).

Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language tasks, but their reliance on visual processing introduces critical security vulnerabilities. Their vision encoders remain susceptible to adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms while maintaining coherent language generation. Current approaches attempt to address this by adversarially fine-tuning CLIP vision encoders on ImageNet-scale data, but exhibit inherent limitations in both robustness and generalization due to the restricted scale and diversity of adversarial training. In this work, we present an alternative approach by leveraging vision encoders adversarially pre-trained on billion-scale image-text pairs. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these encoders to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts , without requiring additional adversarial training, and (2) end-to-end MLLM optimization with these robust encoders facilitates enhanced adaptation of language components to robust visual features, substantially outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust encoders achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2× and 1.5× average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against advanced jailbreaking attacks compared to state-of-the-art methods.

Current multi-modal large language models (MLLMs) struggle to achieve high adversarial robustness while maintaining strong vision-language reasoning. Methods such as TeCoA, FARE, and SimCLIP perform constrained adversarial fine-tuning of CLIP to preserve the generalization capabilities of the pre-trained model. However, this limited adversarial training results in only modest robustness gains when the model is integrated into an MLLM framework. Moreover, the misalignment between adversarial CLIP training objectives and MLLMs' generative understanding creates a semantic alignment gap, impairing MLLMs' ability to perform complex visual reasoning. This leads us to explore whether current large-scale adversarially pre-trained vision encoders, which contain rich robust representations, can exhibit strong semantic alignment within the MLLM framework.

Left: We investigate the multimodal alignment of robust encoders by aligning the feature space of robust encoders using a linear layer with the pre-trained CLIP model, which has a strong multimodal feature representation. We then align robust encoders with CLIP’s text encoder to evaluate robust zero-shot performance, in order to assess their robust multimodal alignment.

Right: The results demonstrate a strong correlation between model scale, training strategy, and robustness preservation during CLIP alignment. Small-scale models (e.g., ViT-B and ResNet-101) suffer significant robustness degradation post-alignment, with accuracy dropping below 60% across all datasets.In contrast, large-scale models (ViT-H and ViT-G) successfully retain their robustness while acquiring robust zero-shot capabilities. Leveraging this insight, we integrate these robust encoders into the LLaVA framework, achieving strong adversarial robustness and semantic alignment in MLLMs without additional specialized adversarial training.

Installation 💿

You can follow the instructions mentioned in the LLaVA codebase to install the required dependencies or follow the below steps:

Clone the repository:

git clone https://github.com/HashmatShadab/Robust-LLaVA 
cd Robust-LLaVA

Install the required dependencies:

conda create -n robust_llava python=3.10 -y
conda activate robust_llava
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia 
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

pip install open-clip-torch==2.19.0
pip install pycocoevalcap==1.2
pip install inflection==0.5.1
pip install torchattacks

Model Zoo 🧠

Model	Stage 1: Feature Alignment	Stage 2: Instruction Tuning
CLIP	Link	Link
Robust-LLaVA⁴_H	Link	Link
Robust-LLaVA⁴_G	Link	Link
Robust-LLaVA⁴_H + CLIP	Link	Link
Robust-LLaVA⁴_G + CLIP	Link	Link
Robust-LLaVA⁴_H + Robust-LLaVA⁴_G	Link	Link
ViT-B/16 (Adversarially trained on ImageNet-1k)	Link	Link
ViT-L/14 (Naturally trained on ImageNet-21k + ImageNet-1k)	Link	Link

🔗 All checkpoints for Stage 1 (Feature Alignment) and Stage 2 (Instruction Tuning) are available at:

➡️ Stage 1 Checkpoints
➡️ Stage 2 Checkpoints

Previous works, such as FARE⁴ and SimCLIP⁴ adversarially fine-tune CLIP models for a few epochs on ImageNet and then plug the model into the LLaVA framework without further training. For robust vision encoder used in Robust-LLaVA⁴_H and Robust-LLaVA⁴_G, download the AdvXL model weights for huge and giant model from here and update the paths in this file. Similarly, for FARE⁴ and SimCLIP⁴ checkpoints, update paths in this file.

Quantitative Evaluation 📊

We provide detailed instructions for reproducing Robust-LLaVA results on both untargeted and targeted attacks across various image captioning and visual question answering benchmarks. Please refer to docs/EVAL.md for the step-by-step guide.

On untargeted attacks, results across six datasets, covering image captioning and visual question answering tasks, both Robust-LLaVA⁴_G and Robust-LLaVA⁴_H maintain reasonable clean performance while achieving substantial robustness improvements over FARE⁴ and Sim-CLIP⁴ against adversarial attacks, striking the right balance between clean and adversarial generalization.

Both FARE⁴ and Sim-CLIP⁴ show robustness against targeted attacks, but break in a few cases at high perturbation budgets (ε = 8/255). In contrast, Robust-LLaVA⁴_G and Robust-LLaVA⁴_H remain fully robust to these attacks even at high perturbation budgets. This indicates a strong resistance to generating the attacker's targeted output. The robustness of Robust-LLaVA⁴_G stands out further as it continues to generate high-quality captions for adversarial examples, maintaining a strong CIDEr score.

Comparison of various vision encoders integrated with LLaVA against white-box (VisualAdv) and black-box (HADES) jailbreak attacks. The white-box results (Table 3) show that LLaVA with the original CLIP encoder is the most vulnerable, producing the highest number of toxic outputs. In contrast, our Robust-LLaVA⁴_G and Robust-LLaVA⁴_H models significantly reduce toxic content generation. The black-box results (Table 4) highlight the effectiveness of different models against HADES attacks, with the original CLIP encoder exhibiting the highest Attack Success Rate (ASR). In contrast, our Robust-LLaVA models achieve the lowest ASR, demonstrating superior resilience across multiple adversarial scenarios.

Evaluation of vision encoder ensembles within the MLLM framework; assessing their robustness across multiple benchmarks. Our analysis reveals that an ensemble’s robustness is limited by its weakest vision encoder. Across all configurations, we observe that the most vulnerable component dictates the overall robustness, highlighting the importance of coming up with approaches to strengthen ensemble resilience.

Training 🚋

We will soon update the training code with detailed instructions for pretraining, fine-tuning, and ensembling various robust backbones (after completing the code cleanup). If you require early access, please reach out to us, and we can provide an unrefined version upon request, along with the necessary guidance for its use.

Qualitative Analysis 🔍

Untargetted Attack on Image Captioning Task

Targetted Attack on Image Captioning Task

Untargetted Attack on Visual Question Answering(VQA) Task

Robustness to Common Corruptions on Image Captioning Task

BibTeX 📜

@article{malik2025robust,
  title={Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models},
  author={Malik, Hashmat Shadab and Shamshad, Fahad and Naseer, Muzammal and Nandakumar, Karthik and Khan, Fahad and Khan, Salman},
  journal={arXiv preprint arXiv:2502.01576},
  year={2025}
}

Contact 📧

Should you have any question, please create an issue on this repository or contact at [email protected]

References 📖

Our codebase is build upon LLaVA and RobustVLM. We thank them for open-sourcing their codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
assets		assets
corruptions		corruptions
docs		docs
eval_scripts		eval_scripts
llava		llava
llava_utils		llava_utils
metric		metric
open_flamingo		open_flamingo
playground/data		playground/data
vlm_eval		vlm_eval
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Mohamed bin Zayed University of AI (MBZUAI)

🔥 Updates

Installation 💿

Model Zoo 🧠

🔗 All checkpoints for Stage 1 (Feature Alignment) and Stage 2 (Instruction Tuning) are available at:

Quantitative Evaluation 📊

Training 🚋

Qualitative Analysis 🔍

Untargetted Attack on Image Captioning Task

Targetted Attack on Image Captioning Task

Untargetted Attack on Visual Question Answering(VQA) Task

Robustness to Common Corruptions on Image Captioning Task

BibTeX 📜

Contact 📧

References 📖

About

Releases

Packages

Languages

HashmatShadab/Robust-LLaVA

Folders and files

Latest commit

History

Repository files navigation

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Mohamed bin Zayed University of AI (MBZUAI)

🔥 Updates

Installation 💿

Model Zoo 🧠

🔗 All checkpoints for Stage 1 (Feature Alignment) and Stage 2 (Instruction Tuning) are available at:

Quantitative Evaluation 📊

Training 🚋

Qualitative Analysis 🔍

Untargetted Attack on Image Captioning Task

Targetted Attack on Image Captioning Task

Untargetted Attack on Visual Question Answering(VQA) Task

Robustness to Common Corruptions on Image Captioning Task

BibTeX 📜

Contact 📧

References 📖

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages