Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The study addresses how to evaluate conversational recommendations for sustainable city trips when human labeling is expensive and conventional metrics miss stakeholder-centric objectives.
It proposes an LLM-as-a-judge approach that scores recommendations across four dimensions—relevance, diversity, sustainability, and popularity balance—rather than relying on a single aggregate metric.
The authors introduce a three-phase calibration framework: baseline judging with multiple LLMs, expert evaluation to detect systematic misalignment, and dimension-specific calibration using rules and few-shot examples.
Experiments across two recommendation settings show that judges can agree on overall rankings while still exhibiting model-specific biases and high variance across dimensions, especially due to differing interpretations of “sustainability.”
The paper releases prompts and code for reproducibility, along with documentation in the linked GitHub repository.

Abstract

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.

LLMs will be a commodity

Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

Dev.to

AI Voice Agents in Production: What Actually Works in 2026

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Key Points

Abstract

Related Articles

LLMs will be a commodity

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

AI Voice Agents in Production: What Actually Works in 2026

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer