Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

arXiv cs.AI / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies when a black-box LLM can be trusted, focusing on detecting “untrustworthy boundaries” over topics rather than judging answers directly.
It introduces GMRL-BD, an algorithm that uses multiple reinforcement learning agents and a Wikipedia-derived knowledge graph to locate topics where an LLM is likely to produce biased responses under query constraints.
Experiments indicate the method can identify these untrustworthy topic regions using only limited numbers of LLM queries, making it practical for black-box settings.
The authors also released a new dataset covering several popular LLMs (e.g., Llama2, Vicuna, Falcon, Qwen2, Gemma2, Yi-1.5) with labeled topic areas where each model tends to be biased.

Abstract

Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

Black Hat Asia

AI Business

The enforcement gap: why finding issues was never the problem

Dev.to

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool

Dev.to

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises

Dev.to

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently

Dev.to

Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

Key Points

Abstract

Related Articles

Black Hat Asia

The enforcement gap: why finding issues was never the problem

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer