Screening Is Enough

arXiv cs.LG / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that conventional softmax attention lacks an absolute notion of query–key relevance because attention only redistributes fixed mass across competing keys, making it hard to explicitly reject irrelevant keys.
It introduces Multiscreen, a language-model architecture using a “screening” mechanism that thresholds each key for explicit acceptance/rejection and aggregates only the remaining keys, removing global competition.
Experiments reported that Multiscreen matches Transformer validation loss while using about 40% fewer parameters and supports stable training at much higher learning rates.
The method is reported to preserve strong long-context perplexity performance, show minimal retrieval degradation even beyond training context length, and cut inference latency by up to 3.2× at 100K context length.

Abstract

A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2

\times

at 100K context length.

Benchmarking Batch Deep Reinforcement Learning Algorithms

Dev.to

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

Dev.to

How To Leverage AI for Back-Office Headcount Optimization

Dev.to

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

Reddit r/LocalLLaMA

SOTA Language Models Under 14B?

Reddit r/LocalLLaMA

Screening Is Enough

Key Points

Abstract

Related Articles

Benchmarking Batch Deep Reinforcement Learning Algorithms

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

How To Leverage AI for Back-Office Headcount Optimization

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

SOTA Language Models Under 14B?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer