Evaluating Digital Inclusiveness of Digital Agri-Food Tools Using Large Language Models: A Comparative Analysis Between Human and AI-Based Evaluations

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines how to evaluate the digital inclusiveness of digital agri-food tools in the Global South, using the MDII framework as a baseline for expert-led assessment.
  • It benchmarks four LLMs (Grok, Gemini, GPT-4o, and GPT-5) to see whether AI-enabled evaluations can approximate human expert scores more quickly than the current MDII process.
  • Results indicate that LLMs can produce evaluative outputs that resemble expert judgment in some dimensions, but accuracy and reliability vary by model and evaluation context.
  • The study analyzes factors affecting performance, including temperature sensitivity and potential bias sources, highlighting the need for caution when using GenAI for inclusion monitoring.
  • Overall, it offers exploratory evidence for integrating GenAI into faster, resource-constrained digital development monitoring of agritools, while still treating it as a complement rather than a full replacement for experts.

Abstract

Ensuring digital inclusiveness is a critical priority in agri-food systems, particularly in the Global South, where digital divides persist. The Multidimensional Digital Inclusiveness Index (MDII) offers a comprehensive, human-led framework to assess how inclusive digital agricultural tools (agritools) are. However, the current evaluation process is resource intensive, often requiring months to complete. This study explores whether large language models (LLMs) can support a rapid, AI-enabled assessment of digital inclusiveness, complementing the MDII's existing workflow. Using a comparative analysis, the research benchmarks the performance of four LLMs (Grok, Gemini, GPT-4o, and GPT-5) against prior expert-led evaluations. The study investigates model alignment with human scores, sensitivity to temperature settings, and potential sources of bias. Findings suggest that LLMs can generate evaluative outputs that approximate expert judgment in some dimensions, though reliability varies across models and contexts. This exploratory work provides early evidence for the integration of GenAI into inclusive digital development monitoring, with implications for scaling evaluations in time-sensitive or resource-constrained environments.

Evaluating Digital Inclusiveness of Digital Agri-Food Tools Using Large Language Models: A Comparative Analysis Between Human and AI-Based Evaluations | AI Navigate