Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

arXiv cs.AI / 4/6/2026

📰 News

Key Points

  • PRISM is an RL framework that creates a discrete set of causally validated “concepts” by clustering encoder features, and then uses these concepts as an interpretable transfer interface between agents trained with different algorithms.
  • The authors use causal interventions to show that forcing or overriding concept assignments changes chosen actions in 69.4% of tested cases (p=8.6×10^-86), supporting the claim that concepts drive behavior rather than just correlate with it.
  • Concept roles are shown to be uneven: the most frequently used concept causes only a small win-rate drop when ablated, while a less frequent concept can substantially collapse performance, indicating strategy-critical but low-usage concepts.
  • By aligning concepts across agents via optimal bipartite matching, PRISM enables zero-shot strategy transfer (e.g., on Go 7×7 successful transfer pairs reach ~69.5%±3.2% and 76.4%±3.4% win rates vs a standard engine, far above random and misaligned baselines).
  • The approach appears to depend on domains where strategic state is naturally discrete: on Atari Breakout, the same pipeline yields bottleneck policies around random-agent performance, suggesting structural limits on when transfer will work.
  • categories': ['models-research', 'ideas-deep-analysis', 'signals-early-trends'], 'impact_score': 7, 'is_ai_related': true, 'personas': ['engineer', 'pm', 'business'], 'is_opinion': true}

Abstract

We present PRISM (Policy Reuse via Interpretable Strategy Mapping), a framework that grounds reinforcement learning agents' decisions in discrete, causally validated concepts and uses those concepts as a zero-shot transfer interface between agents trained with different algorithms. PRISM clusters each agent's encoder features into K concepts via K-means. Causal intervention establishes that these concepts directly drive - not merely correlate with - agent behavior: overriding concept assignments changes the selected action in 69.4% of interventions (p = 8.6 \times 10^{-86}, 2500 interventions). Concept importance and usage frequency are dissociated: the most-used concept (C47, 33.0% frequency) causes only a 9.4% win-rate drop when ablated, while ablating C16 (15.4% frequency) collapses win rate from 100% to 51.8%. Because concepts causally encode strategy, aligning them via optimal bipartite matching transfers strategic knowledge zero-shot. On Go~7\times7 with three independently trained agents, concept transfer achieves 69.5%\pm3.2% and 76.4%\pm3.4% win rate against a standard engine across the two successful transfer pairs (10 seeds), compared to 3.5% for a random agent and 9.2% without alignment. Transfer succeeds when the source policy is strong; geometric alignment quality predicts nothing (R^2 \approx 0). The framework is scoped to domains where strategic state is naturally discrete: the identical pipeline on Atari Breakout yields bottleneck policies at random-agent performance, confirming that the Go results reflect a structural property of the domain.

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning | AI Navigate