Predict, Don't React: Value-Based Safety Forecasting for LLM Streaming

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents StreamGuard, a model-agnostic streaming guardrail for LLM safety that reframes streaming moderation as a forecasting problem over partial output prefixes rather than earliest-unsafe boundary detection.
  • StreamGuard predicts the expected harmfulness of likely future continuations and uses Monte Carlo rollouts for supervision, enabling early safety intervention without needing exact token-level boundary annotations.
  • Evaluation on safety benchmarks shows improved moderation performance at the 8B scale, including increases in both input-moderation and streaming output-moderation F1 versus a prior strict baseline.
  • On the QWENGUARDTEST streaming benchmark, StreamGuard achieves higher F1 and recall with better on-time intervention and a lower miss rate than the compared streaming guardrail.
  • The approach demonstrates effective transfer across tokenizers and model families, suggesting forecasting-based supervision can support low-latency end-to-end streaming moderation even at smaller scales and with transferred targets.

Abstract

In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.