Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Reddit r/LocalLLaMA / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The article describes a local-first desktop app that reads technical PDFs aloud and highlights the corresponding text in sync while audio plays.
  • It uses Tauri 2.0 for a Mac desktop experience and Kokoro 82M for text-to-speech running entirely on-device.
  • The proposed workflow includes loading/rendering the PDF, extracting readable text, chunking it for TTS, generating speech locally, and then playing audio while tracking the active text segment.
  • The author is considering two export modes: generating an optimized audiobook set of audio files (via llama.cpp with Qwen 0.8B or 2B) and transforming content into a more conversational, podcast-like format.
  • Key engineering challenges include aligning generated speech with the original PDF text, correctly handling code snippets/tables, and reducing initial latency to keep the experience interactive.
Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Hey everyone,

I’ve been building a local-first desktop PDF reader that can read technical books aloud and keep the spoken text highlighted while reading.

The original motivation was pretty practical: I read a lot of programming and technical books, but many publishers either don’t offer audio versions or charge extra for AI-generated audio. I wanted to see how far I could get with a completely local setup instead.

The app is built with Tauri 2.0 and runs locally on my Mac. For TTS I’m using Kokoro 82M. On my M1 Mac, there is a short initial wait while things warm up, but after that the generation is fast enough for normal listening. The current sentence / text segment is highlighted in the reader while the audio plays, so it still feels like reading along rather than just listening to a detached audio file.

The current pipeline is roughly:

  1. Load and render the PDF in the desktop app
  2. Extract readable text from the current section
  3. Split the text into chunks suitable for TTS
  4. Generate speech locally with Kokoro 82M
  5. Play the audio while highlighting the corresponding source text

The two export modes I’m thinking about are:

  • A straight audiobook mode, where the PDF becomes a set of audio files optimized using llama.cpp with Qwen 3.5 0.8B or 2B model
  • A podcast-style mode, where the material is transformed into a more conversational format

The most interesting technical problems so far are:

  • Keeping the generated speech aligned with the original PDF text
  • Handling code snippets and tables in technical books
  • Making the first generation fast enough that the app still feels interactive

After loading the initial 15 sentences that get read aloud I need to process the next 15 ones to continue the reading smoothly or maybe taking a fully different approach how things get preprocessed.

That’s where the project is at right now. I’m still mostly building it for my own reading workflow, but if the result becomes useful enough and the codebase is not too embarrassing, I may open source it later.

submitted by /u/purellmagents
[link] [comments]