TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

arXiv cs.CV / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces TaxaAdapter, a lightweight method for fine-grained text-to-image generation across biological “Tree of Life” species by injecting Vision Taxonomy Model (VTM) embeddings (e.g., BioCLIP) into a frozen diffusion text-to-image model.
  • TaxaAdapter is reported to improve species-level morphology fidelity and species-identity accuracy versus strong baselines while maintaining flexible text control over attributes like pose, style, and background.
  • The authors propose a multimodal LLM-based evaluation metric that converts trait-level descriptions from generated and real images into a more interpretable measure of morphological consistency.
  • Experiments claim strong generalization, including few-shot species synthesis with limited training images and generating species not seen during training.
  • Overall, the work argues that VTMs are an essential component for scalable, fine-grained species generation at large biodiversity scale (10M+ species).

Abstract

Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.