Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction
arXiv cs.AI / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a “text-as-signal” pipeline that converts a text corpus into quantitative semantic indicators by combining full-document embeddings with logprob-based scoring from a configurable positional dictionary.
- In a case study, the authors apply the method to 11,922 Portuguese AI-related news articles using a six-dimension semantic dictionary to create a corpus “identity space” for both document-level and aggregated corpus-level characterization.
- The workflow projects signals onto a noise-reduced low-dimensional manifold for structural interpretation, enabling clearer semantic positioning and comparison across documents.
- It leverages Qwen embeddings, UMAP, directly model-output-space-derived semantic indicators, and a three-stage anomaly-detection procedure to support practical tasks like corpus inspection and monitoring.
- The identity layer is designed to be configurable, allowing the framework to be adapted to different analytical needs rather than relying on a single fixed schema.
Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?
SCMP Tech

Why AI Teams Are Standardizing on a Multi-Model Gateway
Dev.to

a claude code/codex plugin to run autoresearch on your repository
Dev.to

AI startup claims to automate app making but actually just uses humans
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to