Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents Xuanwu VL-2B as an example of turning general multimodal LLMs into an industrial-grade foundation model tailored for content ecosystem needs such as moderation under adversarial conditions.
  • It uses a compact ~2B-parameter architecture (InternViT-300M + MLP + Qwen3 1.7B) designed to balance fine-grained visual perception, language-semantic alignment, and deployment cost.
  • The authors introduce a data iteration and curation mechanism plus a progressive three-stage training pipeline (pre-training, mid-training, post-training) to maintain general capabilities while enabling business specialization.
  • Offline evaluations report improved multimodal benchmark performance (67.90 vs 64.27 for InternVL 3.5 2B) and strong moderation recall, including improved performance on challenging adversarial OCR policy-violating text (weighted overall recall 82.82% vs 76.72% for Gemini-2.5-Pro).
  • The study argues that even with a limited parameter budget, Xuanwu VL-2B can achieve a practical tradeoff between business alignment, robustness to long-tail noise, retention of general capabilities, and cost.

Abstract

In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.