Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop
arXiv cs.AI / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The study addresses how to evaluate conversational recommendations for sustainable city trips when human labeling is expensive and conventional metrics miss stakeholder-centric objectives.
- It proposes an LLM-as-a-judge approach that scores recommendations across four dimensions—relevance, diversity, sustainability, and popularity balance—rather than relying on a single aggregate metric.
- The authors introduce a three-phase calibration framework: baseline judging with multiple LLMs, expert evaluation to detect systematic misalignment, and dimension-specific calibration using rules and few-shot examples.
- Experiments across two recommendation settings show that judges can agree on overall rankings while still exhibiting model-specific biases and high variance across dimensions, especially due to differing interpretations of “sustainability.”
- The paper releases prompts and code for reproducibility, along with documentation in the linked GitHub repository.
Related Articles
LLMs will be a commodity
Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA
AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring
Dev.to
AI Voice Agents in Production: What Actually Works in 2026
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to