Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Pangu-ACE is an educational response generation system that dynamically spends more compute only when needed, using a sample-level cascade from a 1B “tutor-router” to a 7B specialist prompt.
  • The pipeline generates a draft answer and routing signals with the 1B model, then either accepts the draft or escalates each sample to the 7B expert based on task-dependent routing decisions.
  • The paper fixes a major offline evaluation bug that previously over-credited open-form outputs that merely passed superficial formatting checks, and reports improved metrics on EduBench’s full Chinese test archive (7,013 samples).
  • Results show deterministic quality increases from 0.457 to 0.538 and format validity from 0.707 to 0.866 versus the legacy rule_v2 system, with 19.7% of requests handled directly by the 1B model.
  • Although the archived deployment does not yet demonstrate latency gains, the efficiency claim is supported by routing selectivity rather than wall-clock speedup, and the GPT-5.4 baseline re-judging remains pending due to invalid provider configuration credentials.

Abstract

Educational assistants should spend more computation only when the task needs it. This paper rewrites our earlier draft around the system that was actually implemented and archived in the repository: a sample-level 1B to 7B cascade for the shared-8 EduBench benchmark. The final system, Pangu-ACE, uses a 1B tutor-router to produce a draft answer plus routing signals, then either accepts the draft or escalates the sample to a 7B specialist prompt. We also correct a major offline evaluation bug: earlier summaries over-credited some open-form outputs that only satisfied superficial format checks. After CPU-side rescoring from saved prediction JSONL, the full Chinese test archive (7013 samples) shows that cascade_final improves deterministic quality from 0.457 to 0.538 and format validity from 0.707 to 0.866 over the legacy rule_v2 system while accepting 19.7% of requests directly at 1B. Routing is strongly task dependent: IP is accepted by 1B 78.0% of the time, while QG and EC still escalate almost always. The current archived deployment does not yet show latency gains, so the defensible efficiency story is routing selectivity rather than wall-clock speedup. We also package a reproducible artifact-first paper workflow and clarify the remaining external-baseline gap: GPT-5.4 re-judging is implemented locally, but the configured provider endpoint and key are invalid, so final sampled-baseline alignment with GPT-5.4 remains pending infrastructure repair.