Fanar 2.0: Arabic Generative AI Stack

arXiv cs.CL / 3/18/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

Fanar 2.0 is the second generation of Qatar's Arabic-centric Generative AI platform, designed and operated entirely in-house at QCRI with sovereignty as a core principle.
It runs on 256 NVIDIA H100 GPUs and uses a data-quality-first strategy with targeted continual pre-training and model merging to achieve gains while using 8x fewer pre-training tokens than Fanar 1.0.
The core Fanar-27B model is continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes, delivering benchmark gains of Arabic knowledge by 9.1 points, language by 7.3 points, dialects by 3.5 points, and English capability by 7.6 points.
The Fanar 2.0 stack adds capabilities including FanarGuard moderation, Aura long-form ASR, Oryx Arabic-aware image/video understanding and generation, an agentic tool-calling framework for multi-step workflows, Fanar-Sadiq for Islamic content, Fanar-Diwan for classical Arabic poetry generation, FanarShaheen bilingual translation, and a redesigned multi-layer orchestrator for intent-aware routing and safety validation, collectively showing sovereign, resource-constrained AI can rival larger-scale systems.

Abstract

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

Dev.to

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Fanar 2.0: Arabic Generative AI Stack

Key Points

Abstract

Related Articles

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer