Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Reddit r/MachineLearning / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

Phosphene is an open-source desktop app for Apple Silicon Macs that generates video using Lightricks’ LTX 2.3 model via the MLX framework, with one-click installation through Pinokio.
A key differentiator is integrated audio generation: LTX 2.3 produces video and audio together in a single forward pass, keeping timing aligned at the frame level for events like footsteps and lip-sync.
The tool supports multiple workflows—text-to-video, image-to-video, first/last-frame interpolation, and clip extension with continuous audio—plus local prompt rewriting using a Gemma 3 12B 4-bit encoder.
Users can choose between Draft, Standard, and High quality tiers, with High featuring a two-stage TeaCache-accelerated setup that may require an additional on-demand model download.
Feature availability and clip length are adapted to the user’s Mac RAM (e.g., 32GB, 64GB, and 96GB tiers), and generation runs offline in a few seconds per job.

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Phosphene is a free desktop panel for generating video on Apple Silicon Macs. It wraps Lightricks' LTX 2.3 model running natively on Apple's MLX framework, and exposes a one-click install through Pinokio.

The differentiator is audio. LTX 2.3 generates video and audio in a single forward pass — they share the same diffusion process, so timing is tied at the frame level. Footsteps land on the correct frame. Lip movement matches dialogue. Ambient sound is conditioned on the visual content. Most other local video models (Wan, Hunyuan, Mochi) generate silent video; you add audio in post.

https://preview.redd.it/vutakjb0vgyg1.png?width=1916&format=png&auto=webp&s=bfde8a7f91b861666196158fbf0f2b76d7d7b828

What it can do

Four generation modes:

Text → video — describe a scene, get a 5-second clip with synthesized audio
Image → video — start from a still, animate from there with synced audio
First-frame / Last-frame — provide two images, the model interpolates the middle
Extend — append seconds onto an existing clip, audio continuous across the join

Plus prompt rewriting via a local Gemma 3 12B 4-bit text encoder. The same model that reads your prompt for the diffusion stage can also rewrite it in the format LTX 2.3 was trained on. Runs offline, takes a few seconds.

https://preview.redd.it/3irbyie5vgyg1.jpg?width=1920&format=pjpg&auto=webp&s=bb03a0c8e64899a83af7980847e61e28b75397ca

Quality tiers

Three quality levels, picked per-job:

Draft — half resolution, ~2 minutes. For iterating on prompts.
Standard — full 1280×704, 7 minutes. The daily driver. Q4 distilled (25 GB on disk).
High — Q8 two-stage with TeaCache acceleration, ~12 minutes. Adds ~25 GB. Optional download — a button in the panel pulls it on demand. Required for FFLF.

Hardware compatibility

Apple Silicon only. The panel detects your Mac's RAM at boot and gates features accordingly:

32 GB → Compact: lower resolution, shorter clips
64 GB → Comfortable: full 1280×704 baseline
96 GB → High: longer clips, full Q8
128+ GB → Pro: no clamps

This is enforced because LTX 2.3's working tensor footprint is real — there is no way to run a full 1280×704 5-second generation in less than ~30 GB of resident memory. The tier system is honest about it rather than letting users queue jobs that fall out of the OOM killer.

Intel Macs and other platforms are not supported. There is no port path for them — MLX is Apple-only by design.

Audio behavior

Audio quality is conditioned on the prompt. A visual-only prompt produces faint ambient sound, which can read as "near-silent." A prompt with explicit audio cues produces layered foreground sound.

Compare:

"Wizard in forest" → quiet room tone
"Wizard in forest, low whispered chant, ember crackle, distant owl hoot" → audible chant + crackle + owl, all timed to the visuals

This is documented behavior of LTX 2.3, not a Phosphene quirk. Describe the soundscape in your prompt the same way you describe the visual.

How it differs from existing tools

Compared to other locally-runnable video models on a Mac:

vs. ComfyUI workflows — ComfyUI runs LTX 2.3 too, but in a node graph that requires building per-job. Phosphene is a fixed panel: prompt, mode, dimensions, generate. No graph maintenance.
vs. native PyTorch builds (Wan, Mochi, Hunyuan) — those run on torch via MPS, which is a compatibility shim, not native Metal. MLX runs the model directly in Apple's compute framework. The result is meaningful speed and memory differences on the same hardware.
vs. cloud / API services (Pika, Runway) — those generate faster on H100s but require accounts, queue time, monthly subscriptions, and upload of source images. Phosphene runs with no network beyond the initial weight download.
vs. silent local video models — joint audio synthesis is, at the time of writing, unique to LTX 2.3 among models with usable Mac runtimes.

Output format

Lossless H.264 by default — yuv444p, CRF 0 — so your archive is the highest fidelity the renderer can produce. Web/social platforms will re-encode anyway. Override via env variables (LTX_OUTPUT_PIX_FMT, LTX_OUTPUT_CRF) if you want yuv420p directly.

The +faststart movflag is on, so the moov atom is at the front of the file. Gallery thumbnails decode the first frame instantly without downloading the full clip.

Install

Search Phosphene in Pinokio's Discover tab and click Install. Pinokio handles the venv, Python 3.11 pin, MLX pipeline install, codec patches, and ~31 GB of model downloads (Q4 LTX 2.3 + Gemma text encoder). Resumable — if a download is interrupted, hitting Install again picks up where it left off.

Optional: run "hf auth login" in Terminal first to authenticate the Hugging Face downloads. Anonymous downloads are throttled; authenticated downloads are roughly 10× faster, which matters for the optional 25 GB Q8 model.

[ATTACH VIDEO: phosphene_hero_x.mp4]

License + credits

Phosphene panel: MIT.
LTX 2.3 weights: Lightricks' own license — read it before commercial use.
MLX framework: Apache 2.0 (Apple).
Gemma weights: Google's terms.

Built on:

LTX 2.3 model — Lightricks
MLX port (ltx-2-mlx) — u/dgrauet
MLX framework — Apple ML
Pinokio runtime — u/cocktailpeanut

Source: github.com/mrbizarro/phosphene. Issues and PRs welcome.

submitted by /u/Opening-Ad5541
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/1DailyView insight →

Black Hat USA

AI Business

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest

Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Dev.to

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Announcing the NVIDIA Nemotron 3 Super Build Contest

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer