AI Navigate

I spent a weekend doing layer surgery on 6 different model architectures. There's a "danger zone" at 50% depth that kills every one of them.

Reddit r/LocalLLaMA / 3/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Duplicated transformer layers across six architectures revealed a universal 'danger zone' around 50-56% depth that degrades performance regardless of architecture.
  • Optimal duplication depth varies by model type; in the Hybrid 9B case, duplicating layers to 75-84% depth yielded a 7/10 score with +3 delta, while 56-65% depth dropped to 2/10, showing a dangerous mid-range.
  • Cross-model layer transplant is not viable: simply matching dimensions is insufficient to preserve capabilities across architectures.
  • The work was done locally on Apple Silicon (M3 Ultra, 512GB) with MLX, not training or cloud access, and suggests a minimum viable model of around 3B parameters.

TL;DR: Duplicated transformer layers in 5 model architectures (Dense 32B, Hybrid 9B, MoE 30B, Dense 3B, cross-model transplant 7B). Found a universal "danger zone" at ~50-56% depth that kills models regardless of architecture. Optimal duplication depth varies by type. Cross-model layer transplant is a hard no — matching dimensions isn't enough. Minimum viable model: ~3B.

All local on Apple Silicon (M3 Ultra, 512GB) via MLX. No cloud, no API, no training — just surgery and automated benchmarks.


Background

David Noel Ng published a technique for duplicating transformer layers to boost capabilities without retraining (original post). The idea: if a layer block handles "reasoning," giving the model a second pass through that circuit should help it think harder. Like re-reading a paragraph before answering.

I wanted to map where the functional circuits actually live, whether it generalizes across architectures, and what breaks when you push it.

Phase 1-3: Dense 32B (Qwen2.5-Coder-32B, 64 layers)

Mapped 5 functional circuits at different depths: - L28-34 (44-53%) — "structural reasoning": Different coding style. True O(1) implementations, reversed data structure polarity, underflow detection others miss. - L36-42 (56-65%) — "verification circuit": Writes the best test suites but introduces bugs in helper code. The builder and checker are literally different circuits.

Result: 10/10 vs 10/10 tie. Model was too strong to benefit. Layer duplication changed how it codes, not what it can solve. Important: this means you can't improve a model that already aces your benchmark.

Phase 4: Hybrid 9B (Qwen3.5-9B-abliterated, 32 layers, linear attention)

This model was weak enough to fail (4/10 baseline). Now we can measure actual capability change.

Position Depth Score Delta
L4-7 13-22% 4/10 0
L8-11 25-34% 5/10 +1
L12-15 38-47% 4/10 0
L18-21 56-65% 2/10 -2 (DANGER ZONE)
L24-27 75-84% 7/10 +3 (WINNER)

L24-27: 75% capability improvement. Three new problems solved (three_sum, word_break, longest_prefix), nothing lost from original. The "one more chance to think" hypothesis confirmed.

L18-21: actively destroys capability when doubled. These layers are attention routing — a valve that must flow at exactly the right rate.

Phase 5: Surgery Experiments on 9B

What if we get creative?

Experiment Score What happened
Double-stack (two good circuits) 3/10 Circuits interfere, not compound
Triple-stack (3x best block) 1/10 Sharp cliff — barely produces Python
Forbidden Cut (delete danger zone + boost reasoning) 0/10 Total brain death

The danger zone is load-bearing. Delete it = output dies. Duplicate it = reasoning dies. Must exist exactly once. The model is less modular than you'd hope.

The triple-stack finding is important: there's no "think harder by thinking more." One extra pass = +75%. Two extra passes = garbage. Binary threshold.

Phase 6: MoE 30B (Qwen3-30B-A3B, 48 layers, 256 experts, top-8)

The 75-85% depth rule was WRONG for MoE.

Winner: L18-21 at 38-44% depth (14/15, +1 over 13/15 baseline). The "reasoning core" in MoE models sits earlier — routing gates create implicit depth through expert selection.

Additional MoE experiments:

Experiment Score Finding
1 layer duplicated 11/15 (-2) Minimum 4 layers to help
2 layers duplicated 12/15 (-1) Still below threshold
4 layers duplicated 14/15 (+1) Minimum effective dose
12 experts (up from 8) 13/15 (0) Neutral
16 experts 10/15 (-3) Wrong experts drown signal
24 experts 8/15 (-5) Catastrophic
Layer dup + wider experts 13/15 (0) Cancel each other out

Dormant experts exist for a reason. Forcing them to vote is like asking everyone in a meeting to speak instead of the 8 who know the topic.

One interesting anomaly: valid_parens (bracket matching) was ALWAYS failed by the baseline and ALL layer-dup variants. But EVERY expert-width variant passed it. The capability exists in dormant experts — it just never gets selected by top-8 routing. Fascinating but not actionable since wider routing destroys harder problems.

Phase 7: Minimum Viable Model Size

Model Params Baseline Best Variant Delta
Qwen2.5-0.5B 0.5B 2/15 2/15 0
Qwen2.5-1.5B 1.5B ~4/15 ~4/15 0
Qwen2.5-3B 3B 8/15 9/15 +1

Head-to-head on 3B: Original 8/15 vs Frankenstein 9/15. Gained regex_match and median_sorted, lost group_anagrams. Speed penalty: -7.6% (127 vs 117 tok/s).

Minimum viable model: ~3B parameters. Below that, there aren't enough functional circuits to have spare reasoning capacity worth duplicating.

Phase 8: Cross-Model Layer Transplant (the big swing)

The dream: take math reasoning layers from Qwen2.5-Math-7B and graft them into Qwen2.5-7B-Instruct. Both models share identical hidden dimensions (H=3584, heads=28, kv_heads=4, intermediate=18944). Perfect dimensional compatibility.

Variant Code (of 15) Math (of 5) Verdict
Host (General-7B) 14 4 Baseline
Donor (Math-7B) 3 4 Baseline
L8-11 replace (29-39%) 3 1 Catastrophic
L8-11 insert (29-39%) 7 4 Half coding gone
L14-17 replace (50-61%) 0 0 Lobotomy
L14-17 insert (50-61%) 0 0 Lobotomy
L20-23 replace (71-82%) 0 0 Lobotomy
L20-23 insert (71-82%) 0 0 Lobotomy

Cross-model transplant is a hard no. 6 of 6 variants either destroyed the model or severely degraded it. The only survivor (L8-11 insert) just added foreign layers early enough that the host routed around them — it didn't absorb math capability.

Key insight: Matching tensor dimensions is necessary but not sufficient. Layers develop model-specific internal representations during training. Swapping layers between models is like transplanting a paragraph from one book into another — same language, same page size, completely wrong context.

This confirms that frankenmerge works by duplicating a model's own circuits (letting it think twice through its own logic), not by transplanting foreign capabilities.

The Universal Danger Zone

Replicated across ALL 5 architectures tested:

Architecture Layers Danger Zone Depth %
Dense 32B 64 L36-42 56-65%
Hybrid 9B 32 L18-21 56-65%
MoE 30B 48 L24-27 50-56%
Dense 3B 36 L18-20 50-56%
Transplant 7B 28 L14-17 50-61%

These layers are the model's attention routing infrastructure. They're not a "circuit" you can duplicate or swap — they're the wiring between circuits. Mess with the wiring, everything downstream breaks.

Optimal Duplication Depth by Architecture

Type Optimal Depth Reasoning
Dense (32B) 44-53% Structural reasoning mid-stack
Hybrid linear (9B) 75-84% Reasoning lives late in linear attention
MoE (30B) 38-44% Expert routing pushes reasoning earlier
Dense (3B) 28-36% Smaller models reason earlier

Practical Guide for Local Builders

  1. Benchmark your model first. If it already passes everything, frankenmerge can't help (Phase 3).
  2. Start with 4 layers at ~75% depth for dense, ~40% for MoE.
  3. One block, one copy. Every attempt to do more made things worse.
  4. Models under 3B: don't bother. Not enough circuit depth.
  5. If your variant outputs SyntaxErrors or gibberish, you hit the danger zone. Move your duplication point.
  6. Don't transplant between models. Duplication only. Same model, same layers, one extra copy.

Methodology

All benchmarks: 15 LeetCode-style problems, 3 tiers (Standard/Medium/Hard). Code generated by the model, extracted, executed against hidden test cases. PASS = code actually runs and produces correct output. No LLM-as-judge, no vibes-based scoring.

~8% speed penalty per 4 duplicated layers (7 extra layers on 64-layer model = -9%, 4 extra on 36-layer = -7.6%).

Full lab notebook and all scripts available on request.

What's Next

  • Block size sweep: is 4 layers optimal or just the first size that works?
  • LoRA on duplicated layers: can fine-tuning sharpen the extra pass?
  • Repeat runs (3x minimum) for variance analysis
  • Test on Llama, Mistral, Phi architectures

Drew Smith — Rocktalk Research Letting the Rocks Cry Out

submitted by /u/Low_Ground5234
[link] [comments]