ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

arXiv cs.CL / 3/17/2026

💬 OpinionModels & Research

共有:

Key Points

ContiGuard proposes a framework for continual toxicity detection to contend with evolving evasive perturbations that bypass existing detectors.
The approach addresses the limitation of static toxicity detectors by enabling ongoing adaptation as new evasion tactics emerge.
The framework aims to maintain robust detection performance over time, integrating new data while mitigating forgetting of prior knowledge.
The work targets safer online environments by improving the resilience of toxicity detection systems against adversarial content.

Abstract

Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector's continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection...

Continue reading this article on the original site.

Read original →

[D] Matryoshka Representation Learning

Reddit r/MachineLearning

Two new Qwen3.5 “Neo” fine‑tunes focused on fast, efficient reasoning

Reddit r/LocalLLaMA

HKIC, Gobi Partners and HKU team up for fund backing university research start-ups

SCMP Tech

Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling

MarkTechPost

Streaming experts

Simon Willison's Blog

ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

Key Points

Abstract

Related Articles

[D] Matryoshka Representation Learning

Two new Qwen3.5 “Neo” fine‑tunes focused on fast, efficient reasoning

HKIC, Gobi Partners and HKU team up for fund backing university research start-ups

Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling

Streaming experts

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer