Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

arXiv cs.LG / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates why large language models often fail formatting instructions when they are required to complete demanding tasks at the same time, framing it as a prospective-memory problem from cognitive psychology.
  • Across three model families and 8,000+ prompts, formatting compliance drops by 2–21% under concurrent task load, indicating a measurable interference effect.
  • The vulnerability is strongly constraint-type dependent: terminal (response-boundary) constraints suffer the most, with compliance reductions up to 50%, while avoidance constraints degrade less.
  • Adding salience-enhancing formatting—explicit instruction framing plus a trailing reminder—substantially recovers compliance, bringing many settings back to 90–100%.
  • The study also finds bidirectional interference (formatting constraints can reduce task accuracy, e.g., GSM8K dropping from 93% to 27%), and shows compliance worsens as more constraints are stacked, using programmatic checkers on public datasets.

Abstract

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.