ArguAgent: AI-Supported Real-Time Grouping for Productive Argumentation in STEM Classrooms

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • ArguAgent is a generative AI system designed to form student groups in real time for more productive, inclusive argumentation in STEM classrooms by balancing stance heterogeneity while tightly limiting differences in argument quality.
  • The system uses a two-stage pipeline: it scores student arguments on a 0–4 rubric, then clusters students’ positions using semantic analysis.
  • The argument-scoring component was validated against human expert consensus with Krippendorff’s α of 0.817 using 200 expert-generated scores.
  • Experiments with multiple OpenAI models (GPT-4o-mini, GPT-5.1, GPT-5.2) show that prompt engineering based on human disagreement analysis drove most of the scoring improvement (89%), with model upgrades contributing the remaining 11%.
  • In simulations across 100 classes, ArguAgent’s grouping met both design constraints in 95.4% of cases—about a 3.2× improvement over random assignment—suggesting it can support theoretically grounded real-time grouping.

Abstract

Argumentation is a core practice in STEM education, but its productivity depends on who participates and how they interact. Higher-achieving students often dominate the talk and decision-making, while lower-achieving peers may disengage, defer, or comply without contributing substantive reasoning. Forming groups strategically based on students' stances and argumentation skills could help foster inclusive, evidence-based discourse. In practice, however, teachers are constrained in implementing this grouping strategy because it requires real-time insight into students' positions and the quality of their argumentation, information that is difficult to assess reliably and at scale during instruction. We present a generative AI-powered system, ArguAgent, that creates groups optimizing for stance heterogeneity while constraining argumentation quality differences to +/-1 level on a validated learning progression. ArguAgent uses a two-component assessment pipeline: first scoring student arguments on a 0-4 rubric, then clustering positions via semantic analysis. We validated the scoring component against human expert consensus (Krippendorff's {\alpha}\alpha {\alpha} = 0.817) using 200 expert-generated scores. Testing three OpenAI models (GPT-4o-mini, GPT-5.1, GPT-5.2) with identical calibrated prompts, we found that systematic prompt engineering informed by human disagreement analysis contributed 89% of scoring improvement (QWK: 0.531 to 0.686), while model upgrades contributed an additional 11% (QWK: 0.686 to 0.708). Simulation testing across 100 classes demonstrated that the grouping algorithm achieves 95.4% of groups that meet both design criteria, a 3.2x improvement over random assignment. These results suggest ArguAgent can enable real-time, theoretically grounded grouping that promotes productive STEM argumentation in classrooms.