Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes LiteCoST for long-document question answering by consolidating dispersed evidence into a structured, auditable output like tables, graphs, or aligned chunks.
  • It introduces Chain-of-Structured-Thought (CoST), a schema-aware prompting template that guides a stronger LLM to generate both a step-wise reasoning trace and the corresponding structured output, including normalization, alignment, and verification/refinement.
  • LiteCoST uses two-stage fine-tuning of small language models (SLMs) on LLM-generated CoST data: supervised fine-tuning for structural alignment followed by GRPO with multiple rewards for answer/format quality and process consistency.
  • Experiments claim LLM-comparable accuracy on multi-domain long-document QA using 3B/7B SLMs, while achieving 2–4x lower latency than GPT-4o and DeepSeek-R1 (671B).
  • The authors provide code via the referenced GitHub repository to enable reproduction and further experimentation.

Abstract

Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.