Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

arXiv cs.AI / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates whether AI tools can “upgrade” low-cognitive-demand math tasks into higher-quality tasks, rather than only judging task quality.
  • Eleven AI tools were tested using a Task Analysis Guide framework by prompting them with strategies modeled on typical teacher approaches; results showed only moderate success overall (accurate upgrades 64% of the time).
  • Performance varied widely across tools, from quite weak (33%) to broadly successful (88%), indicating uneven capability in task modification.
  • Specialized math-teacher tools were only moderately better than general-purpose tools, suggesting domain specialization alone does not guarantee reliable curriculum adaptation.
  • Common failure modes included “undershooting” (tasks stayed low-demand) and “overshooting” (tasks became too ambitious and likely unacceptable), and the ability to upgrade tasks correlated poorly with the ability to classify cognitive demand (r = -0.35).

Abstract

While recent research has explored AI tools' ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, coteach.ai). Using the Task Analysis Guide framework (Stein & Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both "undershooting" (maintaining low cognitive demand) and "overshooting" (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI's potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.