Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs

arXiv cs.AI / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies the challenge of maintaining faithful, consistent persona characterization in long open-ended dialogues and proposes Memory-Driven Role-Playing (MRP) where persona knowledge is treated as an internal memory store retrieved from dialogue context.
It introduces MREval, MRPrompt, and MRBench (a bilingual Chinese/English benchmark) to diagnose and enhance four memory-driven abilities: Anchoring, Recalling, Bounding, and Enacting.
Experimental results show that MRPrompt enables small models (e.g., Qwen3-8B) to match the performance of larger closed-source LLMs (e.g., Qwen3-Max, GLM-4.7), demonstrating memory-focused prompting can boost efficiency.
The work highlights that upstream memory gains improve downstream response quality and provides a comprehensive diagnostic suite across 12 LLMs.

Abstract

A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski's "emotional memory" acting theory, this paradigm frames persona knowledge as the LLM's internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.

Is AI becoming a bubble, and could it end like the dot-com crash?

Reddit r/artificial

Externalizing State

Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Dev.to

My AI Does Not Have a Clock

Dev.to

How to settle on a coding LLM ? What parameters to watch out for ?

Reddit r/LocalLLaMA

Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs

Key Points

Abstract

Related Articles

Is AI becoming a bubble, and could it end like the dot-com crash?

Externalizing State

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

My AI Does Not Have a Clock

How to settle on a coding LLM ? What parameters to watch out for ?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer