TL;DR: Duplicated transformer layers in 5 model architectures (Dense 32B, Hybrid 9B, MoE 30B, Dense 3B, cross-model transplant 7B). Found a universal "danger zone" at ~50-56% depth that kills models regardless of architecture. Optimal duplication depth varies by type. Cross-model layer transplant is a hard no — matching dimensions isn't enough. Minimum viable model: ~3B.
All local on Apple Silicon (M3 Ultra, 512GB) via MLX. No cloud, no API, no training — just surgery and automated benchmarks.
Background
David Noel Ng published a technique for duplicating transformer layers to boost capabilities without retraining (original post). The idea: if a layer block handles "reasoning," giving the model a second pass through that circuit should help it think harder. Like re-reading a paragraph before answering.
I wanted to map where the functional circuits actually live, whether it generalizes across architectures, and what breaks when you push it.
Phase 1-3: Dense 32B (Qwen2.5-Coder-32B, 64 layers)
Mapped 5 functional circuits at different depths: - L28-34 (44-53%) — "structural reasoning": Different coding style. True O(1) implementations, reversed data structure polarity, underflow detection others miss. - L36-42 (56-65%) — "verification circuit": Writes the best test suites but introduces bugs in helper code. The builder and checker are literally different circuits.
Result: 10/10 vs 10/10 tie. Model was too strong to benefit. Layer duplication changed how it codes, not what it can solve. Important: this means you can't improve a model that already aces your benchmark.
Phase 4: Hybrid 9B (Qwen3.5-9B-abliterated, 32 layers, linear attention)
This model was weak enough to fail (4/10 baseline). Now we can measure actual capability change.
| Position | Depth | Score | Delta |
|---|---|---|---|
| L4-7 | 13-22% | 4/10 | 0 |
| L8-11 | 25-34% | 5/10 | +1 |
| L12-15 | 38-47% | 4/10 | 0 |
| L18-21 | 56-65% | 2/10 | -2 (DANGER ZONE) |
| L24-27 | 75-84% | 7/10 | +3 (WINNER) |
L24-27: 75% capability improvement. Three new problems solved (three_sum, word_break, longest_prefix), nothing lost from original. The "one more chance to think" hypothesis confirmed.
L18-21: actively destroys capability when doubled. These layers are attention routing — a valve that must flow at exactly the right rate.
Phase 5: Surgery Experiments on 9B
What if we get creative?
| Experiment | Score | What happened |
|---|---|---|
| Double-stack (two good circuits) | 3/10 | Circuits interfere, not compound |
| Triple-stack (3x best block) | 1/10 | Sharp cliff — barely produces Python |
| Forbidden Cut (delete danger zone + boost reasoning) | 0/10 | Total brain death |
The danger zone is load-bearing. Delete it = output dies. Duplicate it = reasoning dies. Must exist exactly once. The model is less modular than you'd hope.
The triple-stack finding is important: there's no "think harder by thinking more." One extra pass = +75%. Two extra passes = garbage. Binary threshold.
Phase 6: MoE 30B (Qwen3-30B-A3B, 48 layers, 256 experts, top-8)
The 75-85% depth rule was WRONG for MoE.
Winner: L18-21 at 38-44% depth (14/15, +1 over 13/15 baseline). The "reasoning core" in MoE models sits earlier — routing gates create implicit depth through expert selection.
Additional MoE experiments:
| Experiment | Score | Finding |
|---|---|---|
| 1 layer duplicated | 11/15 (-2) | Minimum 4 layers to help |
| 2 layers duplicated | 12/15 (-1) | Still below threshold |
| 4 layers duplicated | 14/15 (+1) | Minimum effective dose |
| 12 experts (up from 8) | 13/15 (0) | Neutral |
| 16 experts | 10/15 (-3) | Wrong experts drown signal |
| 24 experts | 8/15 (-5) | Catastrophic |
| Layer dup + wider experts | 13/15 (0) | Cancel each other out |
Dormant experts exist for a reason. Forcing them to vote is like asking everyone in a meeting to speak instead of the 8 who know the topic.
One interesting anomaly: valid_parens (bracket matching) was ALWAYS failed by the baseline and ALL layer-dup variants. But EVERY expert-width variant passed it. The capability exists in dormant experts — it just never gets selected by top-8 routing. Fascinating but not actionable since wider routing destroys harder problems.
Phase 7: Minimum Viable Model Size
| Model | Params | Baseline | Best Variant | Delta |
|---|---|---|---|---|
| Qwen2.5-0.5B | 0.5B | 2/15 | 2/15 | 0 |
| Qwen2.5-1.5B | 1.5B | ~4/15 | ~4/15 | 0 |
| Qwen2.5-3B | 3B | 8/15 | 9/15 | +1 |
Head-to-head on 3B: Original 8/15 vs Frankenstein 9/15. Gained regex_match and median_sorted, lost group_anagrams. Speed penalty: -7.6% (127 vs 117 tok/s).
Minimum viable model: ~3B parameters. Below that, there aren't enough functional circuits to have spare reasoning capacity worth duplicating.
Phase 8: Cross-Model Layer Transplant (the big swing)
The dream: take math reasoning layers from Qwen2.5-Math-7B and graft them into Qwen2.5-7B-Instruct. Both models share identical hidden dimensions (H=3584, heads=28, kv_heads=4, intermediate=18944). Perfect dimensional compatibility.
| Variant | Code (of 15) | Math (of 5) | Verdict |
|---|---|---|---|
| Host (General-7B) | 14 | 4 | Baseline |
| Donor (Math-7B) | 3 | 4 | Baseline |
| L8-11 replace (29-39%) | 3 | 1 | Catastrophic |
| L8-11 insert (29-39%) | 7 | 4 | Half coding gone |
| L14-17 replace (50-61%) | 0 | 0 | Lobotomy |
| L14-17 insert (50-61%) | 0 | 0 | Lobotomy |
| L20-23 replace (71-82%) | 0 | 0 | Lobotomy |
| L20-23 insert (71-82%) | 0 | 0 | Lobotomy |
Cross-model transplant is a hard no. 6 of 6 variants either destroyed the model or severely degraded it. The only survivor (L8-11 insert) just added foreign layers early enough that the host routed around them — it didn't absorb math capability.
Key insight: Matching tensor dimensions is necessary but not sufficient. Layers develop model-specific internal representations during training. Swapping layers between models is like transplanting a paragraph from one book into another — same language, same page size, completely wrong context.
This confirms that frankenmerge works by duplicating a model's own circuits (letting it think twice through its own logic), not by transplanting foreign capabilities.
The Universal Danger Zone
Replicated across ALL 5 architectures tested:
| Architecture | Layers | Danger Zone | Depth % |
|---|---|---|---|
| Dense 32B | 64 | L36-42 | 56-65% |
| Hybrid 9B | 32 | L18-21 | 56-65% |
| MoE 30B | 48 | L24-27 | 50-56% |
| Dense 3B | 36 | L18-20 | 50-56% |
| Transplant 7B | 28 | L14-17 | 50-61% |
These layers are the model's attention routing infrastructure. They're not a "circuit" you can duplicate or swap — they're the wiring between circuits. Mess with the wiring, everything downstream breaks.
Optimal Duplication Depth by Architecture
| Type | Optimal Depth | Reasoning |
|---|---|---|
| Dense (32B) | 44-53% | Structural reasoning mid-stack |
| Hybrid linear (9B) | 75-84% | Reasoning lives late in linear attention |
| MoE (30B) | 38-44% | Expert routing pushes reasoning earlier |
| Dense (3B) | 28-36% | Smaller models reason earlier |
Practical Guide for Local Builders
- Benchmark your model first. If it already passes everything, frankenmerge can't help (Phase 3).
- Start with 4 layers at ~75% depth for dense, ~40% for MoE.
- One block, one copy. Every attempt to do more made things worse.
- Models under 3B: don't bother. Not enough circuit depth.
- If your variant outputs SyntaxErrors or gibberish, you hit the danger zone. Move your duplication point.
- Don't transplant between models. Duplication only. Same model, same layers, one extra copy.
Methodology
All benchmarks: 15 LeetCode-style problems, 3 tiers (Standard/Medium/Hard). Code generated by the model, extracted, executed against hidden test cases. PASS = code actually runs and produces correct output. No LLM-as-judge, no vibes-based scoring.
~8% speed penalty per 4 duplicated layers (7 extra layers on 64-layer model = -9%, 4 extra on 36-layer = -7.6%).
Full lab notebook and all scripts available on request.
What's Next
- Block size sweep: is 4 layers optimal or just the first size that works?
- LoRA on duplicated layers: can fine-tuning sharpen the extra pass?
- Repeat runs (3x minimum) for variance analysis
- Test on Llama, Mistral, Phi architectures
Drew Smith — Rocktalk Research Letting the Rocks Cry Out
[link] [comments]



