Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies sentence-level readability in German ESG reports to help non-expert consumers access sustainability information more clearly.
  • It extends an existing sentence dataset for German ESG reports by adding crowdsourced readability annotations and finds that readability judgments are subjective.
  • The authors evaluate multiple readability scoring approaches against human rankings, using prediction error and correlation metrics.
  • Results indicate that LLM prompting can help distinguish easy versus hard sentences, but a small fine-tuned transformer achieves the lowest prediction error versus human readability.
  • Ensembling/averaging predictions from multiple models can slightly improve accuracy, though it increases inference latency.

Abstract

With the ever-growing urgency of sustainability in the economy and society, and the massive stream of information that comes with it, consumers need reliable access to that information. To address this need, companies began publishing so called Environmental, Social, and Governance (ESG) reports, both voluntarily and forced by law. To serve the public, these reports must be addressed not only to financial experts but also to non-expert audiences. But are they written clearly enough? In this work, we extend an existing sentence-level dataset of German ESG reports with crowdsourced readability annotations. We find that, in general, native speakers perceive sentences in ESG reports as easy to read, but also that readability is subjective. We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings. Our analysis shows that, while LLM prompting has potential for distinguishing clear from hard-to-read sentences, a small finetuned transformer predicts human readability with the lowest error. Averaging predictions of multiple models can slightly improve the performance at the cost of slower inference.