Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

arXiv cs.AI / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies when a black-box LLM can be trusted, focusing on detecting “untrustworthy boundaries” over topics rather than judging answers directly.
  • It introduces GMRL-BD, an algorithm that uses multiple reinforcement learning agents and a Wikipedia-derived knowledge graph to locate topics where an LLM is likely to produce biased responses under query constraints.
  • Experiments indicate the method can identify these untrustworthy topic regions using only limited numbers of LLM queries, making it practical for black-box settings.
  • The authors also released a new dataset covering several popular LLMs (e.g., Llama2, Vicuna, Falcon, Qwen2, Gemma2, Yi-1.5) with labeled topic areas where each model tends to be biased.

Abstract

Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.