Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation

arXiv cs.CL / 4/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces multilingual Tip-of-the-Tongue (ToT) retrieval test collections for Chinese, Japanese, Korean, and English to address the field’s previous reliance on English-only benchmarks.
It uses an LLM-based query simulation framework to generate synthetic ToT queries and evaluates how prompt language and source-document language influence the fidelity of the simulations.
The authors validate simulated queries by measuring system rank correlation against real user queries, showing that language-aware design is crucial for effective ToT simulation.
Key findings indicate that non-English source documents are generally important, while English Wikipedia can help when non-English sources lack enough information for query generation.
The work releases four large-scale ToT benchmarks (5,000 queries per language across multiple domains) and provides guidance for building realistic ToT datasets for languages beyond English.

Abstract

Tip-of-the-Tongue (ToT) retrieval benchmarks have largely focused on English, limiting their applicability to multilingual information access. In this work, we construct multilingual ToT test collections for Chinese, Japanese, Korean, and English, using an LLM-based query simulation framework. We systematically study how prompt language and source document language affect the fidelity of simulated ToT queries, validating synthetic queries through system rank correlation against real user queries. Our results show that effective ToT simulation requires language-aware design choices: non-English language sources are generally important, while English Wikipedia can be beneficial when non-English sources provide insufficient information for query generation. Based on these findings, we release four ToT test collections with 5,000 queries per language across multiple domains. This work provides the first large-scale multilingual ToT benchmark and offers practical guidance for constructing realistic ToT datasets beyond English.

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer