Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

arXiv cs.LG / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates why large language models often fail formatting instructions when they are required to complete demanding tasks at the same time, framing it as a prospective-memory problem from cognitive psychology.
Across three model families and 8,000+ prompts, formatting compliance drops by 2–21% under concurrent task load, indicating a measurable interference effect.
The vulnerability is strongly constraint-type dependent: terminal (response-boundary) constraints suffer the most, with compliance reductions up to 50%, while avoidance constraints degrade less.
Adding salience-enhancing formatting—explicit instruction framing plus a trailing reminder—substantially recovers compliance, bringing many settings back to 90–100%.
The study also finds bidirectional interference (formatting constraints can reduce task accuracy, e.g., GSM8K dropping from 93% to 27%), and shows compliance worsens as more constraints are stacked, using programmatic checkers on public datasets.

Abstract

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

They Did Not Accidentally Make Work the Answer to Who You Are

Dev.to

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

They Did Not Accidentally Make Work the Answer to Who You Are

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer