Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions
arXiv cs.AI / 5/1/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces Intent2Tx, a new benchmark for evaluating LLMs that translate natural-language user intents into Ethereum transactions with state-dependent correctness.
- Intent2Tx is built from 300 days of real Ethereum mainnet traces, containing 29,921 single-step and 1,575 multi-step instances across 11 protocol/DeFi categories, avoiding reliance on synthetic instructions.
- The authors propose an execution-aware evaluation framework using differential state analysis on forked mainnet environments to test whether outputs produce the intended on-chain state transitions.
- Experiments across 16 leading LLMs show improvements from scaling and retrieval-augmentation, but persistent weaknesses in out-of-distribution generalization and multi-step planning.
- The results reveal a key gap: outputs that are syntactically valid may still fail to accomplish the desired state changes, underscoring limitations in current “reasoning-to-execution” for Web3 agents.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Automating FDA Compliance: AI for Specialty Food Producers
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER
I hate this group but not literally
Reddit r/LocalLLaMA