Smart Bilingual Focused Crawling of Parallel Documents

arXiv cs.CL / 3/25/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the inefficiency of brute-force crawling for parallel (mutually translated) documents by proposing a “smart” crawl strategy that targets parallel content earlier.
  • It uses a neural approach based on a pre-trained multilingual Transformer encoder, fine-tuned for two URL- and pair-based tasks: predicting a document’s language from its URL and predicting whether two URLs point to parallel documents.
  • The authors evaluate both models separately and then as an integrated crawling tool, showing that each component is effective on its own.
  • Combining the language-from-URL and URL-pair parallelism models improves early discovery of parallel content for a specific language pair during crawling, reducing useless downloads and increasing the number of parallel documents found versus conventional methods.

Abstract

Crawling parallel texts -- texts that are mutual translations -- from the Internet is usually done following a brute-force approach: documents are massively downloaded in an unguided process, and only a fraction of them end up leading to actual parallel content. In this work we propose a smart crawling method that guides the crawl towards finding parallel content more rapidly. We follow a neural approach that consists in adapting a pre-trained multilingual language model based on the encoder of the Transformer architecture by fine-tuning it for two new tasks: inferring the language of a document from its Uniform Resource Locator (URL), and inferring whether a pair of URLs link to parallel documents. We evaluate both models in isolation and their integration into a crawling tool. The results demonstrate the individual effectiveness of both models, and highlight that their combination enables us to address a practical engineering challenge: the early discovery of parallel content during web crawling in a given language pair. This leads to a reduction in the amount of downloaded documents deemed useless, and yields a greater quantity of parallel documents compared to conventional crawling approaches.