ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

arXiv cs.CL / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

ORPHEAS is a specialized Greek-English embedding model designed to improve cross-lingual retrieval-augmented generation in bilingual Greek–English scenarios.
The model addresses limitations of existing multilingual embeddings by concentrating capacity on Greek-specific morphology and domain terminology rather than spreading representational power across many languages.
ORPHEAS is trained using a high-quality dataset created via a knowledge-graph-based fine-tuning approach over a diverse multi-domain corpus.
Experiments on monolingual and cross-lingual retrieval benchmarks show ORPHEAS outperforms current state-of-the-art multilingual embedding models without sacrificing cross-lingual retrieval performance.
The results suggest that domain-specialized fine-tuning for morphologically complex languages can yield better bilingual semantic alignment for RAG systems.

Abstract

Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.