Semantic Invariance in Agentic AI
arXiv cs.AI / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper presents a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents under semantic variations.
- It defines eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) and tests across seven foundation models spanning four architectures (Hermes, Qwen3, DeepSeek-R1, and gpt-oss).
- It evaluates 19 multi-step reasoning problems across eight scientific domains, finding that model scale does not predict robustness; smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91).
- The results suggest robustness cannot be inferred from size alone, highlighting the need for metamorphic test benchmarks in evaluating LLM agents.
Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to

The Research That Doesn't Exist
Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI
TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap
Dev.to