What Do Your Logits Know? (The Answer May Surprise You!)

Apple Machine Learning Journal / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Recent research shows that probing neural network internals can expose information that isn’t visible from the model’s outputs, creating risks of unintentional or malicious information leakage.
This paper uses vision-language models to systematically compare how much information is retained when it is compressed at different “representational levels,” starting from the residual stream.
The analysis examines information passing through two natural bottlenecks: low-dimensional projections and attention-based pooling/aggregation mechanisms.
The results suggest that seemingly “compressed” representations (e.g., logits-related signals) can still preserve substantial recoverable information, potentially surprising model owners and evaluators.
The work highlights a need for stronger privacy and security assumptions when deploying models, since internal-feature leakage may occur even when generations look safe.

Recent work has shown that probing model internals can reveal a wealth of information not apparent from the model generations. This poses the risk of unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible. Using vision-language models as a testbed, we present the first systematic comparison of information retained at different “representational levels” as it is compressed from the rich information encoded in the residual stream through two natural bottlenecks: low-dimensional projections of the residual…

Continue reading this article on the original site.

Read original →