End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

arXiv cs.CL / 3/12/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The paper presents an end-to-end automatic evaluator for domain-specific chatbots that reduces manual review by automatically generating Q&A pairs from the underlying knowledge base and using LLMs to judge chatbot responses against reference answers.
It introduces confidence-based filtering to highlight uncertain cases, helping reviewers focus on the most ambiguous outputs.
The method is demonstrated on a Vietnamese news dataset, where it achieves high agreement with human judgments while significantly lowering review overhead.
The framework is modular and language-agnostic, enabling easy adaptation to diverse domains and deployment scenarios with minimal manual intervention.

Abstract

Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Key Points

Abstract

Related Articles

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像