Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations
arXiv cs.LG / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a new dataset of 24 Excalidraw demonstrations paired with narrated audio, covering 8 STEM domains, with millisecond-precision timestamps for every drawing element.
- It evaluates a LoRA fine-tuned vision-language model (Qwen2-VL-7B) to generate structured stroke sequences synchronized to speech, using only the small demonstration set.
- Topic-stratified five-fold experiments show that conditioning on timestamps substantially improves temporal alignment versus ablated baselines.
- The model demonstrates cross-topic generalization to unseen STEM subjects, suggesting transferability beyond the training domains.
- The authors discuss how the approach could extend to real classroom production workflows and release the dataset and code for further research.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to