CentaurTA Studio: A Self-Improving Human-Agent Collaboration System for Thematic Analysis

arXiv cs.AI / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CentaurTA Studio, a web-based system designed to scale thematic analysis through self-improving human–agent collaboration in open coding and theme construction.
  • It combines a two-stage human feedback workflow (simulator drafting plus expert validation), persistent prompt optimization that turns validated feedback into reusable alignment principles, and rubric-based evaluation with early stopping.
  • Across three domains, CentaurTA delivers up to 92.12% accuracy, outperforming baseline systems and achieving substantial agreement between an LLM-based judge and human annotators (average κ = 0.68).
  • Experiments and ablation studies show that removing key components—especially the feedback loop, the Critic, or early stopping—significantly reduces accuracy and/or increases interaction cost, while best results are reached within about 10 iterative rounds (~25 minutes).

Abstract

Thematic analysis is difficult to scale: manual workflows are labor-intensive, while fully automated pipelines often lack controllability and transparent evaluation. We present \textbf{CentaurTA Studio}, a web-based system for self-improving human--agent collaboration in open coding and theme construction. The system integrates (1) a two-stage human feedback pipeline separating simulator drafting and expert validation, (2) persistent prompt optimization that distills validated feedback into reusable alignment principles, and (3) rubric-based evaluation with early stopping for process control. Across three domains, CentaurTA achieves the strongest performance in both Open Coding and Theme Construction, reaching up to 92.12\% accuracy and consistently outperforming baseline systems. Agreement between the rubric-based LLM judge and human annotators reaches substantial reliability (average \kappa = 0.68). Ablation studies show that removing the feedback loop reduces performance from 90\% to 81\%, while eliminating the Critic or early stopping degrades accuracy or increases interaction cost. The full system reaches peak performance within 10 iterative rounds (about 25 minutes), demonstrating improved efficiency over expert-only refinement.