SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

arXiv cs.CL / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The article introduces SAHM, a new Arabic financial NLP benchmark and instruction-tuning dataset focused on document-grounded and Shari'ah-compliant reasoning.
SAHM includes 14,380 expert-verified examples across seven tasks, covering AAOIFI standards QA, fatwa-based QA/MCQ, accounting/business exams, sentiment analysis, extractive summarization, and event-cause reasoning.
The authors evaluate 19 open and proprietary LLMs with task-specific metrics and rubric-based scoring for open-ended responses.
Results show that strong Arabic language ability does not reliably translate into evidence-grounded financial reasoning, with the biggest performance gaps on event-cause reasoning.
The benchmark, evaluation framework, and an instruction-tuned model are released to enable further research into trustworthy Arabic financial NLP.

Abstract

English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari'ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

Key Points

Abstract

Related Articles

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer