Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

arXiv cs.CL / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a domain-adapted RAG pipeline for tutoring-move (pedagogical dialogue acts) annotation that improves LLM performance without fine-tuning the generative model itself.
  • Instead of updating the LLM, the approach fine-tunes a lightweight embedding model on tutoring corpora and performs utterance-level indexing to retrieve labeled few-shot demonstrations.
  • Experiments on TalkMoves and Eedi, using multiple LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), show substantially higher agreement scores (Cohen’s κ) than no-retrieval baselines.
  • An ablation study indicates utterance-level indexing is the primary driver of gains, with top-1 label match rates rising notably under domain-adapted retrieval.
  • Retrieval is also shown to reduce systematic label biases from zero-shot prompting and to yield the biggest improvements for rare, context-dependent labels, suggesting retrieval adaptation can be a practical route to higher-quality annotation.

Abstract

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's \kappa of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines (\kappa = 0.275-0.413 and 0.160-0.410). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.