Skip to content

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Notifications You must be signed in to change notification settings

HashmatShadab/Robust-LLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

74 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, and Salman Khan

Mohamed bin Zayed University of AI (MBZUAI)

paper Website Model Weights


πŸ”₯ Updates

  • (Feb 03, 2025)
    • Adversarial evaluation codes are released.
    • Robust-LLaVA-H and Robust-LLaVA-G released: Excited to release the new integration of LLaVA with large-scale robust image encoders, ViT-H and ViT-G, respectively. πŸ”₯πŸ”₯

Robust-LLaVA Diagram

Robust score of Robust-LLaVA4 on downstream vision-language tasks with adversarial examples crafted at Ξ΅ = 4/255: The original CLIP integrated into LLaVA exhibits minimal robustness. Our proposed Robust-LLaVA4 outperforms state-of-the-art robust CLIP models, such as FARE4 and Sim-CLIP4 in robustness score across all tasks and diverse datasets, while maintaining high clean accuracy. (Accuracy is reported for VQAv2 and TextVQA, while CIDER score is reported for Flickr30k and COCO).

Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language tasks, but their reliance on visual processing introduces critical security vulnerabilities. Their vision encoders remain susceptible to adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms while maintaining coherent language generation. Current approaches attempt to address this by adversarially fine-tuning CLIP vision encoders on ImageNet-scale data, but exhibit inherent limitations in both robustness and generalization due to the restricted scale and diversity of adversarial training. In this work, we present an alternative approach by leveraging vision encoders adversarially pre-trained on billion-scale image-text pairs. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these encoders to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts , without requiring additional adversarial training, and (2) end-to-end MLLM optimization with these robust encoders facilitates enhanced adaptation of language components to robust visual features, substantially outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust encoders achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2Γ— and 1.5Γ— average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against advanced jailbreaking attacks compared to state-of-the-art methods.


Robust-LLaVA Diagram

Current multi-modal large language models (MLLMs) struggle to achieve high adversarial robustness while maintaining strong vision-language reasoning. Methods such as TeCoA, FARE, and SimCLIP perform constrained adversarial fine-tuning of CLIP to preserve the generalization capabilities of the pre-trained model. However, this limited adversarial training results in only modest robustness gains when the model is integrated into an MLLM framework. Moreover, the misalignment between adversarial CLIP training objectives and MLLMs' generative understanding creates a semantic alignment gap, impairing MLLMs' ability to perform complex visual reasoning. This leads us to explore whether current large-scale adversarially pre-trained vision encoders, which contain rich robust representations, can exhibit strong semantic alignment within the MLLM framework.

Left: We investigate the multimodal alignment of robust encoders by aligning the feature space of robust encoders using a linear layer with the pre-trained CLIP model, which has a strong multimodal feature representation. We then align robust encoders with CLIP’s text encoder to evaluate robust zero-shot performance, in order to assess their robust multimodal alignment.

Right: The results demonstrate a strong correlation between model scale, training strategy, and robustness preservation during CLIP alignment. Small-scale models (e.g., ViT-B and ResNet-101) suffer significant robustness degradation post-alignment, with accuracy dropping below 60% across all datasets.In contrast, large-scale models (ViT-H and ViT-G) successfully retain their robustness while acquiring robust zero-shot capabilities. Leveraging this insight, we integrate these robust encoders into the LLaVA framework, achieving strong adversarial robustness and semantic alignment in MLLMs without additional specialized adversarial training.


Installation πŸ’Ώ

You can follow the instructions mentioned in the LLaVA codebase to install the required dependencies or follow the below steps:

  1. Clone the repository:
git clone https://github.com/HashmatShadab/Robust-LLaVA 
cd Robust-LLaVA
  1. Install the required dependencies:
conda create -n robust_llava python=3.10 -y
conda activate robust_llava
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia 
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

pip install open-clip-torch==2.19.0
pip install pycocoevalcap==1.2
pip install inflection==0.5.1
pip install torchattacks

Model Zoo 🧠

Model Stage 1: Feature Alignment Stage 2: Instruction Tuning
CLIP Link Link
Robust-LLaVA4H Link Link
Robust-LLaVA4G Link Link
Robust-LLaVA4H + CLIP Link Link
Robust-LLaVA4G + CLIP Link Link
Robust-LLaVA4H + Robust-LLaVA4G Link Link
ViT-B/16 (Adversarially trained on ImageNet-1k) Link Link
ViT-L/14 (Naturally trained on ImageNet-21k + ImageNet-1k) Link Link

πŸ”— All checkpoints for Stage 1 (Feature Alignment) and Stage 2 (Instruction Tuning) are available at:

➑️ Stage 1 Checkpoints
➑️ Stage 2 Checkpoints

Previous works, such as FARE4 and SimCLIP4 adversarially fine-tune CLIP models for a few epochs on ImageNet and then plug the model into the LLaVA framework without further training. For robust vision encoder used in Robust-LLaVA4H and Robust-LLaVA4G, download the AdvXL model weights for huge and giant model from here and update the paths in this file. Similarly, for FARE4 and SimCLIP4 checkpoints, update paths in this file.


Quantitative Evaluation πŸ“Š

We provide detailed instructions for reproducing Robust-LLaVA results on both untargeted and targeted attacks across various image captioning and visual question answering benchmarks. Please refer to docs/EVAL.md for the step-by-step guide.

Robust-LLaVA Diagram

On untargeted attacks, results across six datasets, covering image captioning and visual question answering tasks, both Robust-LLaVA4G and Robust-LLaVA4H maintain reasonable clean performance while achieving substantial robustness improvements over FARE4 and Sim-CLIP4 against adversarial attacks, striking the right balance between clean and adversarial generalization.

Robust-LLaVA Diagram

Both FARE4 and Sim-CLIP4 show robustness against targeted attacks, but break in a few cases at high perturbation budgets (Ξ΅ = 8/255). In contrast, Robust-LLaVA4G and Robust-LLaVA4H remain fully robust to these attacks even at high perturbation budgets. This indicates a strong resistance to generating the attacker's targeted output. The robustness of Robust-LLaVA4G stands out further as it continues to generate high-quality captions for adversarial examples, maintaining a strong CIDEr score.

Robust-LLaVA Diagram

Comparison of various vision encoders integrated with LLaVA against white-box (VisualAdv) and black-box (HADES) jailbreak attacks. The white-box results (Table 3) show that LLaVA with the original CLIP encoder is the most vulnerable, producing the highest number of toxic outputs. In contrast, our Robust-LLaVA4G and Robust-LLaVA4H models significantly reduce toxic content generation. The black-box results (Table 4) highlight the effectiveness of different models against HADES attacks, with the original CLIP encoder exhibiting the highest Attack Success Rate (ASR). In contrast, our Robust-LLaVA models achieve the lowest ASR, demonstrating superior resilience across multiple adversarial scenarios.

Robust-LLaVA Diagram

Evaluation of vision encoder ensembles within the MLLM framework; assessing their robustness across multiple benchmarks. Our analysis reveals that an ensemble’s robustness is limited by its weakest vision encoder. Across all configurations, we observe that the most vulnerable component dictates the overall robustness, highlighting the importance of coming up with approaches to strengthen ensemble resilience.


Training πŸš‹

We will soon update the training code with detailed instructions for pretraining, fine-tuning, and ensembling various robust backbones (after completing the code cleanup). If you require early access, please reach out to us, and we can provide an unrefined version upon request, along with the necessary guidance for its use.


Qualitative Analysis πŸ”

Untargetted Attack on Image Captioning Task

Targetted Attack on Image Captioning Task

Untargetted Attack on Visual Question Answering(VQA) Task

Robustness to Common Corruptions on Image Captioning Task


BibTeX πŸ“œ

@article{malik2025robust,
  title={Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models},
  author={Malik, Hashmat Shadab and Shamshad, Fahad and Naseer, Muzammal and Nandakumar, Karthik and Khan, Fahad and Khan, Salman},
  journal={arXiv preprint arXiv:2502.01576},
  year={2025}
}

Contact πŸ“§

Should you have any question, please create an issue on this repository or contact at [email protected]


References πŸ“–

  • Our codebase is build upon LLaVA and RobustVLM. We thank them for open-sourcing their codebase.

About

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages