Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience
arXiv cs.CV / 3/19/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates whether general-purpose multimodal large language models (MLLMs) can perform iris presentation attack detection (PAD) under strict privacy constraints, using human expert knowledge to augment prompts.
- Pre-trained vision transformers in MLLMs inherently cluster iris attack types in their embeddings, even without explicit training for PAD.
- When structured prompts incorporating human salience (verbal indicators from subjects) are used, the models resolve ambiguities and improve detection.
- On a IRB-restricted dataset of 224 iris images spanning seven attack types, using university-approved services or locally-hosted models, Gemini with expert-informed prompts outperforms a CNN-based baseline and human examiners, while Llama 3.2-Vision achieves near-human performance.
- The results suggest MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD, addressing data-sharing and privacy challenges while maintaining high accuracy.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA