aMuseMe: When Small Models Compose a Visual Symphony

Dev.to / 6/16/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • aMuseMe generates a complete, stylized lyric video automatically from an input audio file, avoiding manual keyframing and editing work.
  • The system uses a pipeline idea of four stages—word-level listening/timestamping, lyrics-to-display-line layout, background illustration, and 30fps HD rendering—implemented entirely locally within a 32B parameter budget.
  • The first stage (“Listener”) relies on faster-whisper (Whisper large-v3, ~1.55B) to extract precise word-level timestamps so each word highlights at the exact sung moment.
  • Achieving accurate timestamps on music (not clean speech) required tuning, including using condition_on_previous_text to improve accuracy while employing VAD with aggressive thresholds to prevent hallucinated lyrics during instrumental breaks.

Continue reading this article on the original site.

Read original →