AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces AgriChain, an ~11,000-image agricultural dataset of expert-curated leaf images across multiple crops and diseases, each labeled with disease type, a calibrated confidence level, and an expert-verified chain-of-thought rationale.
  • Explanations were initially drafted by GPT-4o and then verified by a professional agricultural engineer using standardized visual descriptors such as lesion color, margin, and distribution to improve reliability and interpretability.
  • A specialized model, AgriChain-VL3B, is fine-tuned from Qwen2.5-VL-3B using this dataset to jointly predict diseases and produce visually grounded reasoning.
  • On a 1,000-image test set, the CoT-supervised model reaches 73.1% top-1 accuracy (macro F1 0.466; weighted F1 0.655), outperforming baselines including Gemini variants and GPT-4o Mini.
  • The work argues that expert-verified reasoning supervision improves both accuracy and alignment with human expert explanations, and it provides the dataset and code publicly.

Abstract

Accurate and interpretable plant disease diagnosis remains a major challenge for vision-language models (VLMs) in real-world agriculture. We introduce AgriChain, a dataset of approximately 11,000 expert-curated leaf images spanning diverse crops and pathologies, each paired with (i) a disease label, (ii) a calibrated confidence score (High/Medium/Low), and (iii) an expert-verified chain-of-thought (CoT) rationale. Draft explanations were first generated by GPT-4o and then verified by a professional agricultural engineer using standardized descriptors (e.g., lesion color, margin, and distribution). We fine-tune Qwen2.5-VL-3B on AgriChain, resulting in a specialized model termed AgriChain-VL3B, to jointly predict diseases and generate visually grounded reasoning. On a 1,000-image test set, our CoT-supervised model achieves 73.1% top-1 accuracy (macro F1 = 0.466; weighted F1 = 0.655), outperforming strong baselines including Gemini 1.5 Flash, Gemini 2.5 Pro, and GPT-4o Mini. The generated explanations align closely with expert reasoning, consistently referencing key visual cues. These findings demonstrate that expert-verified reasoning supervision significantly enhances both accuracy and interpretability, bridging the gap between generic multimodal models and human expertise, and advancing trustworthy, globally deployable AI for sustainable agriculture. The dataset and code are publicly available at: https://github.com/hazzanabeel12-netizen/agrichain