Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation
arXiv cs.LG / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The study systematically compares LSTM and Transformer architectures for symbolic music generation (SMG) by evaluating how each models local melodic continuity versus global structural coherence.
- On the Deutschl dataset, LSTMs are found to capture local patterns better but struggle to preserve long-range dependencies, while Transformers maintain global structure but often generate irregular phrasing.
- The authors propose a Hybrid model (Transformer Encoder + LSTM Decoder) designed to leverage LSTM strengths for local consistency and Transformer strengths for global coherence.
- Across evaluations of 1,000 generated melodies per architecture using 17 quality metrics, the hybrid approach improves both local and global continuity and coherence versus the two baselines.
- Ablation studies and human perceptual evaluations further support the quantitative findings, validating the rationale behind the architectural fusion.