SimDiff: Depth Pruning via Similarity and Difference

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SimDiff, a new depth-pruning criterion for improving the inference efficiency of large language models by removing redundant layers.
  • Unlike prior approaches that rely mainly on layer-to-layer cosine similarity, SimDiff evaluates layers using two complementary, orthogonal signals: representational similarity and transformation difference.
  • It quantifies transformation difference with two metrics—MSSD (outlier-sensitive, emphasizing decisive corrections) and MASD (robust average contribution)—to avoid unpredictable or even catastrophic failures seen with single-heuristic methods.
  • Experiments across multiple models (0.5B–13B parameters) show SimDiff outperforms existing baselines across different pruning ratios, preserving over 91% of LLaMA2-7B performance at 25% pruning and enabling up to 1.49× inference speedup for LLaMA3.1-8B.
  • The authors report that heavily pruned models can be recovered effectively with minimal fine-tuning, suggesting practical deployability beyond one-shot pruning.

Abstract

Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.