Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents Xuanwu VL-2B as an example of turning general multimodal LLMs into an industrial-grade foundation model tailored for content ecosystem needs such as moderation under adversarial conditions.
It uses a compact ~2B-parameter architecture (InternViT-300M + MLP + Qwen3 1.7B) designed to balance fine-grained visual perception, language-semantic alignment, and deployment cost.
The authors introduce a data iteration and curation mechanism plus a progressive three-stage training pipeline (pre-training, mid-training, post-training) to maintain general capabilities while enabling business specialization.
Offline evaluations report improved multimodal benchmark performance (67.90 vs 64.27 for InternVL 3.5 2B) and strong moderation recall, including improved performance on challenging adversarial OCR policy-violating text (weighted overall recall 82.82% vs 76.72% for Gemini-2.5-Pro).
The study argues that even with a limited parameter budget, Xuanwu VL-2B can achieve a practical tradeoff between business alignment, robustness to long-tail noise, retention of general capabilities, and cost.

Abstract

In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

Black Hat Asia

AI Business

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Dev.to

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Key Points

Abstract

Related Articles

Black Hat Asia

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Does the Claude “leak” actually change anything in practice?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer