A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images

arXiv cs.CV / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a vision-language model approach to generate 3D procedural plant architecture representations from simulated (synthetic) image inputs.
  • Instead of relying on 3D sensors or multi-view computer vision, the method encodes plant architecture as token sequences that a language model predicts, enabling organ-level geometric and topological parameter recovery.
  • Training and evaluation use a synthetic cowpea dataset generated with the Helios 3D plant simulator, where exact architectural parameters are available via XML ground truth.
  • The model shows strong performance in sequence prediction (token F1 of 0.73 with teacher forcing) and high similarity in autoregressive generation (BLEU-4 of 94.00% and ROUGE-L of 0.5182).
  • The authors conclude that organ-level architectural parameter extraction from images is feasible using a VLM and plan to extend the workflow to real-world imagery in future work.

Abstract

Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters and nested structures for these models at the field scales remains prohibitively labor-intensive. We present a novel algorithm that generates a 3D plant architecture from an image, creating a functional structural plant model that reflects organ-level geometric and topological parameters and provides a more comprehensive representation of the plant's architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we proposed a method that generates token sequences that encode a procedural definition of plant architecture. This work used only synthetic images for training and testing, with exact architectural parameters known, allowing testing of the hypothesis that organ-level architectural parameters could be extracted from image data using a vision-language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. The model achieved a token F1 score of 0.73 during teacher-forced training. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. This led to the conclusion that such plant architecture model generation and parameter extraction were possible from synthetic images; thus, future work will extend the approach to real imagery data.