Neighbor Embedding for High-Dimensional Sparse Poisson Data

arXiv stat.ML / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses dimensionality reduction for high-dimensional, sparse count data that are well-modeled by Poisson distributions, which often violate the assumptions of standard methods like PCA and t-SNE.
It introduces p-SNE (Poisson Stochastic Neighbor Embedding), a nonlinear neighbor-embedding approach tailored to Poisson count structure.
p-SNE defines pairwise dissimilarity using KL divergence between Poisson distributions and optimizes the embedding using Hellinger distance.
Experiments on synthetic and real datasets show p-SNE can recover meaningful structure, including communication weekday patterns, topic clusters in OpenReview papers, and neural spike temporal drift/stimulus gradients.
The results suggest that incorporating the underlying probabilistic model of sparse counts can improve embedding quality versus geometry-assuming techniques designed for continuous data.

Abstract

Across many scientific fields, measurements often represent the number of times an event occurs. For example, a document can be represented by word occurrence counts, neural activity by spike counts per time window, or online communication by daily email counts. These measurements yield high-dimensional count data that often approximate a Poisson distribution, frequently with low rates that produce substantial sparsity and complicate downstream analysis. A useful approach is to embed the data into a low-dimensional space that preserves meaningful structure, commonly termed dimensionality reduction. Yet existing dimensionality reduction methods, including both linear (e.g., PCA) and nonlinear approaches (e.g., t-SNE), often assume continuous Euclidean geometry, thereby misaligning with the discrete, sparse nature of low-rate count data. Here, we propose p-SNE (Poisson Stochastic Neighbor Embedding), a nonlinear neighbor embedding method designed around the Poisson structure of count data, using KL divergence between Poisson distributions to measure pairwise dissimilarity and Hellinger distance to optimize the embedding. We test p-SNE on synthetic Poisson data and demonstrate its ability to recover meaningful structure in real-world count datasets, including weekday patterns in email communication, research area clusters in OpenReview papers, and temporal drift and stimulus gradients in neural spike recordings.