Model Capability Assessment and Safeguards for Biological Weaponization

arXiv cs.AI / 4/23/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The arXiv study benchmarks multiple frontier chat models (ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5, and Meta Muse Spark Thinking) on benign STEM prompts to assess baseline “operational intelligence” for misuse risk.
  • On benign quantitative tasks, Gemini and Meta perform very strongly, while ChatGPT is described as less robust due to “text thinning,” and Claude shows fewer details with some apparent false-positive refusals.
  • A second, more adversarial set with subtle harmful intent finds weaknesses, including edge cases suggesting Gemini’s limited contextual awareness and a potential mismatch between capability growth and moderation calibration.
  • The researchers argue that biological misuse could become a more common geopolitical tool, recommending urgent U.S. policy actions and providing guidance for identifying and differentiating high-risk agents across 25 risk categories.
  • Reported examples include escalating harmful pathways such as poison-ivy-to-crowded-transit scenarios and production/extraction workflows enabled under certain access environments (e.g., international-anonymous, logged-out AI mode).

Abstract

AI leaders and safety reports increasingly warn that advances in model reasoning may enable biological misuse, including by low-expertise users, while major labs describe safeguards as expanding but still evolving rather than settled. This study benchmarks ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5 and Meta's Muse Spark Thinking on 73 novice-framed, open-ended benign STEM prompts to measure operational intelligence. On benign quantitative tasks, both Gemini and Meta scored very high; ChatGPT was partially useful but text-thinned, and Claude was sparsest with some apparent false-positive refusals. A second test set detected subtle harmful intent: edge case prompts revealed Gemini's seeming lack of contextual awareness. These results warranted a focused weaponization analysis on Gemini as capability appeared to be outpacing moderation calibration. Gemini was tested across four access environments and reported cases include poison-ivy-to-crowded-transit escalation, poison production and extraction via international-anonymous logged-out AI Mode, and other concerning examples. Biological misuse may become more prevalent as a geopolitical tool, increasing the urgency of U.S. policy responses, especially if model outputs come to be treated as regulated technical data. Guidance is provided for 25 high-risk agents to help distinguish legitimate use cases from higher-risk ones.