COMPOSITE-Stem

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

物理・生物・化学・数学の70タスクからなる新しい科学向けベンチマーク「COMPOSITE-STEM」を、博士レベル研究者が専門家執筆で構築したと発表している。
厳密一致の採点に加え、条件ベースのルーブリックと「LLM-as-a-jury」による採点プロトコルを組み合わせ、従来より柔軟に“科学的に意味のある出力”を評価できるようにしている。
Harborのエージェント評価フレームワーク上で改変したマルチモーダルTerminus-2エージェントを用いて4つのフロンティアモデルを評価し、最上位モデルは21%を達成したと報告している。
すべてのタスクをオープンソース化し、再現性と研究促進（科学発見の加速に向けた追加検証）を目的とした寄与許可の枠組みを示している。

Abstract

AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/14DailyView insight →

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

COMPOSITE-Stem

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer