APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

arXiv cs.CL / 5/1/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces APPSI-139, a newly released high-quality English parallel corpus of privacy policies annotated by domain experts to improve legal clarity and readability for summarization/interpretation tasks.
APPSI-139 contains 139 English privacy policies along with 15,692 rewritten parallel examples and 36,351 fine-grained labels across 11 data-practice categories.
It also proposes TCSI-pp-V2, a hybrid summarization and interpretation framework that uses alternating training and multiple coordinated expert modules to trade off computational efficiency and accuracy.
Experiments indicate that a hybrid system trained on APPSI-139 with TCSI-pp-V2 outperforms large language models like GPT-4o and LLaMA-3-70B on readability and reliability.
The dataset and source code are published on GitHub, enabling further research and benchmarking in privacy-policy understanding.

Abstract

Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and legalese, causing users to unknowingly accept terms that may even contradict the law. While summarizing and interpreting these privacy policies is crucial, there is a lack of high-quality English parallel corpus optimized for legal clarity and readability. To address this issue, we introduce APPSI-139, a high-quality English privacy policy corpus meticulously annotated by domain experts, specifically designed for summarization and interpretation tasks. The corpus includes 139 English privacy policies, 15,692 rewritten parallel corpora, and 36,351 fine-grained annotation labels across 11 data practice categories. Concurrently, we propose TCSI-pp-V2, a hybrid privacy policy summarization and interpretation framework that employs an alternating training strategy and coordinates multiple expert modules to effectively balance computational efficiency and accuracy. Experimental results show that the hybrid summarization system built on APPSI-139 corpus and the TCSI-pp-V2 framework outperform large language models, such as GPT-4o and LLaMA-3-70B, in terms of readability and reliability. The source code and dataset are available at https://github.com/EnlightenedAI/APPSI-139.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

The Register

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Reddit r/LocalLLaMA

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Reddit r/MachineLearning

APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer