AI Navigate

AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

arXiv cs.CV / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper conducts a systematic comparison of CNNs, contrastive Vision-Language Models, and generative Vision-Language Models for fine-grained crop disease classification.
  • It introduces AgriPath-LF16, a benchmark with 111k images across 16 crops and 41 diseases, including explicit lab vs field imagery separation and a standardized 30k training/evaluation subset.
  • Evaluations are performed under unified protocols across full, lab-only, and field-only training regimes, using macro-F1 and Parse Success Rate to measure both accuracy and generative reliability.
  • Results show CNNs achieve the highest lab accuracy but degrade under domain shift, contrastive VLMs provide robust cross-domain performance with fewer parameters, and generative VLMs are most resilient to distributional variation though they have free-text generation failure modes.
  • The study argues deployment context should guide architectural choice rather than chasing aggregate accuracy alone.

Abstract

Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.