Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models
Abstract
Multi-Faceted Attack (MFA) framework reveals vulnerabilities in VLMs by transferring adversarial attacks across models, achieving high success rates even against state-of-the-art defenses.
The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack
Community
The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards—alignment tuning, system prompts, and content moderation. Yet the real-world robustness of these defenses against adversarial attack remains underexplored. We introduce Multi-Faceted Attack
(MFA), a framework that systematically uncovers general safety vulnerabilities in leading defense-equipped VLMs, including GPT-4o, Gemini-Pro, and LlaMA 4, etc. Central to MFA is the Attention-Transfer Attack (ATA), which conceals harmful instructions inside a meta task with competing objectives. We offer a theoretical perspective grounded in reward hacking to explain why such an attack can succeed. To maximize cross-model transfer, we introduce a lightweight transfer enhancement algorithm combined with a simple repetition strategy that jointly evades both input- and output-level filters—without any model-specific fine-tuning. We empirically show that adversarial images optimized for one vision encoder
transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Combined, MFA reaches a 58.5% overall success rate, consistently outperforming existing methods. Notably, on state-of-the-art commercial models, MFA achieves a 52.8% success rate, outperforming the second-best attack by 34%. These findings challenge the perceived robustness of current defensive mechanisms, systematically expose general safety loopholes within defense-equipped VLMs, and offer a practical probe for diagnosing and strengthening the safety of VLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SAID: Empowering Large Language Models with Self-Activating Internal Defense (2025)
- VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands (2025)
- CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks (2025)
- SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models (2025)
- VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models (2025)
- Black-box Optimization of LLM Outputs by Asking for Directions (2025)
- RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper