Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

arXiv cs.CL / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper explores using large language models to iteratively generate Python policy functions for agents in sequential social dilemmas and evaluate them via self-play with performance feedback.
It compares sparse feedback (scalar reward) to dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace) across two canonical dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6 and Gemini 3.1 Pro), with dense feedback often matching or exceeding sparse.
Dense social metrics act as coordination signals that guide the LLM toward cooperative strategies such as territory partitioning, adaptive role assignment, and avoidance of wasteful aggression, without triggering over-optimization of fairness.
The authors perform an adversarial experiment identifying five attack classes and discuss mitigations, highlighting a tension between expressiveness and safety in LLM policy synthesis.
The work provides code at https://github.com/vicgalle/llm-policies-social-dilemmas, enabling replication and further study.

Abstract

We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

Dev.to

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

Dev.to

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

Key Points

Abstract

Related Articles

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer