PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces PICon, a multi-turn interrogation evaluation framework aimed at checking whether LLM-based persona agents maintain consistent, contradiction-free behavior during extended interactions.
  • PICon assesses persona agents across three dimensions: internal consistency, external consistency with real-world facts, and retest consistency to measure stability under repeated questioning.
  • Experiments with seven groups of persona agents and 63 human participants show that even previously claimed “highly consistent” systems often fall below the human baseline on all three consistency dimensions.
  • The authors report that chained, logically connected questions can trigger contradictions and evasive responses that simpler evaluations may miss.
  • The work includes source code and an interactive demo, positioning PICon as a practical methodology for evaluating persona agents prior to using them as stand-ins for human participants.

Abstract

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/