ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ShishuLM, an efficient language model architecture that reduces Transformer compute by replacing full decoder layers in the top of the model with MLP-only blocks.
  • It reports improved performance characteristics, including up to 10–60% lower generation latency and 1.3–5× higher throughput compared with standard attention-heavy models.
  • The authors further propose parameter sharing across adjacent MLP-only layers, achieving up to 20% memory savings with minimal performance degradation.
  • The work is motivated by observed architectural redundancies in attention sub-layers of higher layers and by prior research on inference-time layer pruning and depth-dependent computation.
  • Overall, it provides guidance for building more efficient pre-training-time model architectures by leveraging how information flows through Transformer layers.

Abstract

While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, particularly in the attention sub-layers in the top layers, presenting opportunities for optimization without compromising performance. Taking insights from research on inference-time layer pruning and depth-dependent computation in language models, we introduce an efficient language model architecture referred to as ShishuLM. By replacing full decoder layers at the top of the model with MLP-only blocks, we achieve up to 10-60% improvement in generation latency and 1.3 -5 \times gain in throughput. Upon further sharing parameters across adjacent MLP-only layers of ShishuLM, we obtain up to 20% savings in memory with minimal degradation in performance. Our findings provide insights towards building more efficient language modeling architectures from a pre-training standpoint by leveraging how information flows in transformers.