APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

arXiv cs.CL / 5/1/2026

📰 NewsModels & Research

Key Points

  • The paper introduces APPSI-139, a newly released high-quality English parallel corpus of privacy policies annotated by domain experts to improve legal clarity and readability for summarization/interpretation tasks.
  • APPSI-139 contains 139 English privacy policies along with 15,692 rewritten parallel examples and 36,351 fine-grained labels across 11 data-practice categories.
  • It also proposes TCSI-pp-V2, a hybrid summarization and interpretation framework that uses alternating training and multiple coordinated expert modules to trade off computational efficiency and accuracy.
  • Experiments indicate that a hybrid system trained on APPSI-139 with TCSI-pp-V2 outperforms large language models like GPT-4o and LLaMA-3-70B on readability and reliability.
  • The dataset and source code are published on GitHub, enabling further research and benchmarking in privacy-policy understanding.

Abstract

Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and legalese, causing users to unknowingly accept terms that may even contradict the law. While summarizing and interpreting these privacy policies is crucial, there is a lack of high-quality English parallel corpus optimized for legal clarity and readability. To address this issue, we introduce APPSI-139, a high-quality English privacy policy corpus meticulously annotated by domain experts, specifically designed for summarization and interpretation tasks. The corpus includes 139 English privacy policies, 15,692 rewritten parallel corpora, and 36,351 fine-grained annotation labels across 11 data practice categories. Concurrently, we propose TCSI-pp-V2, a hybrid privacy policy summarization and interpretation framework that employs an alternating training strategy and coordinates multiple expert modules to effectively balance computational efficiency and accuracy. Experimental results show that the hybrid summarization system built on APPSI-139 corpus and the TCSI-pp-V2 framework outperform large language models, such as GPT-4o and LLaMA-3-70B, in terms of readability and reliability. The source code and dataset are available at https://github.com/EnlightenedAI/APPSI-139.