Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Reddit r/LocalLLaMA / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The article describes a local-first desktop app that reads technical PDFs aloud and highlights the corresponding text in sync while audio plays.
It uses Tauri 2.0 for a Mac desktop experience and Kokoro 82M for text-to-speech running entirely on-device.
The proposed workflow includes loading/rendering the PDF, extracting readable text, chunking it for TTS, generating speech locally, and then playing audio while tracking the active text segment.
The author is considering two export modes: generating an optimized audiobook set of audio files (via llama.cpp with Qwen 0.8B or 2B) and transforming content into a more conversational, podcast-like format.
Key engineering challenges include aligning generated speech with the original PDF text, correctly handling code snippets/tables, and reducing initial latency to keep the experience interactive.

Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Hey everyone,

I’ve been building a local-first desktop PDF reader that can read technical books aloud and keep the spoken text highlighted while reading.

The original motivation was pretty practical: I read a lot of programming and technical books, but many publishers either don’t offer audio versions or charge extra for AI-generated audio. I wanted to see how far I could get with a completely local setup instead.

The app is built with Tauri 2.0 and runs locally on my Mac. For TTS I’m using Kokoro 82M. On my M1 Mac, there is a short initial wait while things warm up, but after that the generation is fast enough for normal listening. The current sentence / text segment is highlighted in the reader while the audio plays, so it still feels like reading along rather than just listening to a detached audio file.

The current pipeline is roughly:

Load and render the PDF in the desktop app
Extract readable text from the current section
Split the text into chunks suitable for TTS
Generate speech locally with Kokoro 82M
Play the audio while highlighting the corresponding source text

The two export modes I’m thinking about are:

A straight audiobook mode, where the PDF becomes a set of audio files optimized using llama.cpp with Qwen 3.5 0.8B or 2B model
A podcast-style mode, where the material is transformed into a more conversational format

The most interesting technical problems so far are:

Keeping the generated speech aligned with the original PDF text
Handling code snippets and tables in technical books
Making the first generation fast enough that the app still feels interactive

After loading the initial 15 sentences that get read aloud I need to process the next 15 ones to continue the reading smoothly or maybe taking a fully different approach how things get preprocessed.

That’s where the project is at right now. I’m still mostly building it for my own reading workflow, but if the result becomes useful enough and the codebase is not too embarrassing, I may open source it later.

submitted by /u/purellmagents
[link] [comments]

Black Hat USA

AI Business

Remote agents in Vibe. Powered by Mistral Medium 3.5.ProductIntroducing Mistral Medium 3.5, remote coding agents in Vibe, plus new Work mode in Le Chat for complex tasks.

Mistral AI Blog

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

15 Lead Magnet Ideas That Actually Convert in 2026

Dev.to

1.14.4a2

CrewAI Releases

Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Key Points

Related Articles

Black Hat USA

Remote agents in Vibe. Powered by Mistral Medium 3.5.ProductIntroducing Mistral Medium 3.5, remote coding agents in Vibe, plus new Work mode in Le Chat for complex tasks.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

15 Lead Magnet Ideas That Actually Convert in 2026

1.14.4a2

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer