PRISM: PRIor from corpus Statistics for topic Modeling

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

PRISM is introduced as a corpus-intrinsic initialization method for LDA that computes Dirichlet parameters from word co-occurrence statistics, avoiding changes to LDA’s original generative process.
The approach is designed to work without external knowledge sources (such as pre-trained embeddings), improving applicability to emerging or underexplored domains.
Experiments on both text corpora and single-cell RNA-seq data indicate higher topic coherence and better interpretability compared with baselines.
PRISM’s performance can rival models that rely on external knowledge, making it attractive for resource-constrained topic modeling scenarios.
The authors provide public code via the associated GitHub repository for reproducibility and adoption.

Abstract

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

87.4% of My Agent's Decisions Run on a 0.8B Model

Dev.to

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

Dev.to

PRISM: PRIor from corpus Statistics for topic Modeling

Key Points

Abstract

Related Articles

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Does the Claude “leak” actually change anything in practice?

87.4% of My Agent's Decisions Run on a 0.8B Model

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer