Image Generators are Generalist Vision Learners
arXiv cs.CV / 4/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that image and video generators develop zero-shot visual understanding abilities similar to emergent reasoning in LLMs trained via generative pretraining.
- It presents Vision Banana, a generalist vision model created by instruction-tuning Nano Banana Pro with a mix of original training data and a small amount of vision-task data.
- By parameterizing vision task outputs as RGB images, the authors reframe perception tasks as an image-generation problem, enabling a unified interface across tasks.
- Vision Banana achieves state-of-the-art or competitive performance on multiple 2D and 3D understanding tasks, outperforming or rivaling task specialists such as SAM 3 for segmentation and Depth Anything for depth estimation.
- The results suggest that lightweight instruction-tuning can preserve strong image-generation capability while producing broadly useful visual representations, implying a potential paradigm shift toward foundational vision models built from generative pretraining.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Why use an AI gateway at all?
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to