AI Navigate

Fanar 2.0: Arabic Generative AI Stack

arXiv cs.CL / 3/18/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Fanar 2.0 is the second generation of Qatar's Arabic-centric Generative AI platform, designed and operated entirely in-house at QCRI with sovereignty as a core principle.
  • It runs on 256 NVIDIA H100 GPUs and uses a data-quality-first strategy with targeted continual pre-training and model merging to achieve gains while using 8x fewer pre-training tokens than Fanar 1.0.
  • The core Fanar-27B model is continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes, delivering benchmark gains of Arabic knowledge by 9.1 points, language by 7.3 points, dialects by 3.5 points, and English capability by 7.6 points.
  • The Fanar 2.0 stack adds capabilities including FanarGuard moderation, Aura long-form ASR, Oryx Arabic-aware image/video understanding and generation, an agentic tool-calling framework for multi-step workflows, Fanar-Sadiq for Islamic content, Fanar-Diwan for classical Arabic poetry generation, FanarShaheen bilingual translation, and a redesigned multi-layer orchestrator for intent-aware routing and safety validation, collectively showing sovereign, resource-constrained AI can rival larger-scale systems.

Abstract

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.