[Tool] Quick hack to recover Qwen3.5 MTP after fine-tuning for faster inference speed (Transformers)

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Qwen3.5 includes Multi-Token Prediction (MTP) support that can accelerate inference when used with vLLM (e.g., qwen3_next_mtp), but Hugging Face Transformers currently lacks MTP support for training/inference.

Disclaimer: I work at NuMind (we train LLMs for structured + content extraction).

If you've been working with Qwen3.5 (and other recently released models), you probably know it includes Multi-Token Prediction (MTP) modules. When used with vLLM (qwen3_next_mtp), this can significantly speed up inference, especially on predictable workloads (the more "predictable" the better since the draft tokens will have a higher acceptance rate).

However:

- Hugging Face Transformers doesn’t support MTP yet, neither for inference nor training

- Thus, if you fine-tune with Trainer, MTP weights are never loaded, trained, or saved

- Result: vLLM crashes when you try to use speculative decoding (using --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":4}') because the weights are missing

Quick workaround

Not perfect, but works: You can just copy the MTP weights from the base model into your fine-tuned model.

* The MTP heads remain untrained

* But in practice, it’s still useful

The code is simply something like

for filepath in path_source_model.glob("*.safetensors"): with safe_open(filepath, framework="pt", device="cpu") as f: for key in f.keys(): if "mtp" in key.lower() or "nextn" in key.lower(): mtp_weights[key] = f.get_tensor(key) save_file(mtp_weights, out_filepath) 

and then updating the model.safetensors.index.json

Using my tool, it is simply a matter of doing

python3 main.py -s Qwen/Qwen3.5-0.8B -t numind/NuExtract-alpha 

to merge the original MTP modules from Qwen3.5 into the fine-tuned model. This should also works with merged LoRA.

In our internal tests:

* Acceptance rate up to ~0.9 up to ~4 tokens

* Highly workload-dependent however

For our larger models and future open weights model, we will however include all the heads during the training in order to improve efficiency/acceptance rate. We have patched transformers to support it and hopefully in the future it will be available for everyone.

Tool

I made a small CLI to do this automatically:

https://github.com/SorenDreano/transplant_mtp (MIT)

Tested on Qwen3.5 models.

Context (what we’re building)

We have released open-weight models for document understanding:

NuExtract 2.0: structured extraction into JSON templates

https://huggingface.co/numind/NuExtract-2.0-8B

NuExtract is a model that takes both a json template input like

{ "Last name": "verbatim-string", "First names": [ "verbatim-string" ], "Document number": "verbatim-string", "Date of birth": "date-time", "Gender": [ "Male", "Female", "Other" ], "Expiration date": "date-time", "Country ISO code": "string" } 

and a document (usually an image or scan) and fills the template with correct information without hallucination.

NuMarkdown: convert documents (images, PDFs, text) into (you guessed it) Markdown

https://huggingface.co/numind/NuMarkdown-8B-Thinking

We are soon going to release a new open weight model that does BOTH structured (json template) AND content (markdown) extraction

We also have a SaaS offering and can deploy on premise https://nuextract.ai

Curious if others have tried different approaches to keep MTP during fine-tuning or if anyone has patched Transformers to support it properly.

submitted by /u/Gailenstorm
[link] [comments]