Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

arXiv cs.CL / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates authorship-attribution methods using stylistic features to enable actor analysis for threat intelligence, starting with Japanese reviews from clear web sources.
Experiments compare TF-IDF+logistic regression, BERT embeddings+logistic regression, BERT fine-tuning, and metric learning with k-NN on Rakuten Ichiba review datasets.
BERT fine-tuning delivers the best overall performance, but becomes unstable when scaling to several hundred authors, while TF-IDF+LR remains more accurate, stable, and cheaper computationally.
Top-k evaluation indicates candidate screening is useful, and error analysis finds misclassifications are driven mainly by boilerplate text, topic dependence, and short text length.
The study positions these results as a foundational step toward future application to dark web forums, implying potential next steps in scaling and robustness.

Abstract

This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and compared four methods: TF-IDF with logistic regression (TF-IDF+LR), BERT embeddings with logistic regression (BERT-Emb+LR), BERT fine-tuning (BERT-FT), and metric learning with

k

-nearest neighbors (Metric+kNN). Results showed that BERT-FT achieved the best performance; however, training became unstable as the number of authors scaled to several hundred, where TF-IDF+LR proved superior in terms of accuracy, stability, and computational cost. Furthermore, Top-

k

evaluation demonstrated the utility of candidate screening, and error analysis revealed that boilerplate text, topic dependency, and short text length were primary factors causing misclassification.