AI Navigate

Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs

arXiv cs.AI / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies the challenge of maintaining faithful, consistent persona characterization in long open-ended dialogues and proposes Memory-Driven Role-Playing (MRP) where persona knowledge is treated as an internal memory store retrieved from dialogue context.
  • It introduces MREval, MRPrompt, and MRBench (a bilingual Chinese/English benchmark) to diagnose and enhance four memory-driven abilities: Anchoring, Recalling, Bounding, and Enacting.
  • Experimental results show that MRPrompt enables small models (e.g., Qwen3-8B) to match the performance of larger closed-source LLMs (e.g., Qwen3-Max, GLM-4.7), demonstrating memory-focused prompting can boost efficiency.
  • The work highlights that upstream memory gains improve downstream response quality and provides a comprehensive diagnostic suite across 12 LLMs.

Abstract

A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski's "emotional memory" acting theory, this paradigm frames persona knowledge as the LLM's internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.