LTX-2.3 based audio model outputs

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post shares example audio outputs generated using an LTX-2.3-based audio model.
It includes several character-style voice prompts (e.g., a villain laugh, noir detective, talk show host) demonstrating different speaking styles and emotional tones.
The examples focus on how the model renders voice acting qualities such as pacing, laughter, breath, gravelly delivery, and theatrical intensity.
The overall takeaway is a practical showcase of multimodal/voice generation capabilities rather than a technical explanation or new release details.

Villain Sinister Laugh
Prompt: A deep-voiced villain speaks with theatrical menace, chuckling softly at first, "Heheheh. Hahahahahahaha! Oh, forgive me, forgive me." He catches his breath with a sinister grin, clears his throat. "It is just SO amusing when they struggle, is it not?" His voice drips with contempt, "I expected more from you, truly I did. How disappointing." He leans in close and whispers with vicious intensity, "But fear not, my dear. The REAL entertainment has only just begun." He chuckles one last time, "Heheheh."

Grizzled Detective (Noir)
Prompt: A grizzled detective speaks in a low, gravelly voice. He takes a long drag of a cigarette and exhales slowly, "This city, it eats people alive, chews them up and spits them out." He coughs, a deep rattling cough, "Heh, these things are going to kill me long before the criminals do." He sighs wearily, "Twenty years I have been on this force. Twenty years of watching good, decent people turn rotten." He chuckles darkly, "You know what the funny thing is? There is nothing funny about any of it, not a damn thing." He clears his throat. "Come on, let us go, we have got work to do."
Talk Show Host (Uncontrollable Laughter)
Prompt: A talk show host speaks with animated enthusiasm. He gasps with exaggerated shock, "No! You did NOT just say that, tell me you did not just say that!" He bursts into uncontrollable laughter, "HAHAHA! Oh my god, oh my god!" He wheezes, barely getting words out, "I cannot, I literally cannot breathe right now!" He wipes his eyes, sniffling, "Oh that is so good, that is really genuinely good." He sighs happily, "Ahhh okay okay, let me compose myself, I am a professional." He takes one breath then immediately cracks up again, "Pfft hehehe, no I absolutely cannot, I am so sorry everybody!" He claps, "Folks, THIS, this right here, is why I love my job!"
Action Hero (Panting Triumph)

Prompt: A muscular man speaks with a thick accent, panting heavily, completely out of breath, "Hah... hah... we made it, we actually made it." He coughs roughly, "Ugh, that was the hardest fight of my entire life, I swear." He groans and clutches his side, "Argh, my ribs, I think something is broken." But then a grin spreads and he laughs heartily despite the pain, "Hahaha! But we WON! Can you believe it? We actually won!" He takes a deep, shuddering breath, "I told you, heh, I told you we would make it. Ahhh, it is finally over."

45 second with stable output.
I am experimenting with continous chunking so it can do longer chunks.
peak vram usage with offloading gemma model is ~8GB vram and if we keep everything in memory it uses around ~21GB vram but boost inference speed significantly.

submitted by /u/manmaynakhashi
[link] [comments]