Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The study addresses how to evaluate conversational recommendations for sustainable city trips when human labeling is expensive and conventional metrics miss stakeholder-centric objectives.
  • It proposes an LLM-as-a-judge approach that scores recommendations across four dimensions—relevance, diversity, sustainability, and popularity balance—rather than relying on a single aggregate metric.
  • The authors introduce a three-phase calibration framework: baseline judging with multiple LLMs, expert evaluation to detect systematic misalignment, and dimension-specific calibration using rules and few-shot examples.
  • Experiments across two recommendation settings show that judges can agree on overall rankings while still exhibiting model-specific biases and high variance across dimensions, especially due to differing interpretations of “sustainability.”
  • The paper releases prompts and code for reproducibility, along with documentation in the linked GitHub repository.

Abstract

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.