Disclaimer: I work at NuMind (we train LLMs for structured + content extraction).
If you've been working with Qwen3.5 (and other recently released models), you probably know it includes Multi-Token Prediction (MTP) modules. When used with vLLM (qwen3_next_mtp), this can significantly speed up inference, especially on predictable workloads (the more "predictable" the better since the draft tokens will have a higher acceptance rate).
However:
- Hugging Face Transformers doesn’t support MTP yet, neither for inference nor training
- Thus, if you fine-tune with Trainer, MTP weights are never loaded, trained, or saved
- Result: vLLM crashes when you try to use speculative decoding (using --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":4}') because the weights are missing
Quick workaround
Not perfect, but works: You can just copy the MTP weights from the base model into your fine-tuned model.
* The MTP heads remain untrained
* But in practice, it’s still useful
The code is simply something like
for filepath in path_source_model.glob("*.safetensors"): with safe_open(filepath, framework="pt", device="cpu") as f: for key in f.keys(): if "mtp" in key.lower() or "nextn" in key.lower(): mtp_weights[key] = f.get_tensor(key) save_file(mtp_weights, out_filepath) and then updating the model.safetensors.index.json
Using my tool, it is simply a matter of doing
python3 main.py -s Qwen/Qwen3.5-0.8B -t numind/NuExtract-alpha to merge the original MTP modules from Qwen3.5 into the fine-tuned model. This should also works with merged LoRA.
In our internal tests:
* Acceptance rate up to ~0.9 up to ~4 tokens
* Highly workload-dependent however
For our larger models and future open weights model, we will however include all the heads during the training in order to improve efficiency/acceptance rate. We have patched transformers to support it and hopefully in the future it will be available for everyone.
Tool
I made a small CLI to do this automatically:
https://github.com/SorenDreano/transplant_mtp (MIT)
Tested on Qwen3.5 models.
Context (what we’re building)
We have released open-weight models for document understanding:
NuExtract 2.0: structured extraction into JSON templates
https://huggingface.co/numind/NuExtract-2.0-8B
NuExtract is a model that takes both a json template input like
{ "Last name": "verbatim-string", "First names": [ "verbatim-string" ], "Document number": "verbatim-string", "Date of birth": "date-time", "Gender": [ "Male", "Female", "Other" ], "Expiration date": "date-time", "Country ISO code": "string" } and a document (usually an image or scan) and fills the template with correct information without hallucination.
NuMarkdown: convert documents (images, PDFs, text) into (you guessed it) Markdown
https://huggingface.co/numind/NuMarkdown-8B-Thinking
We are soon going to release a new open weight model that does BOTH structured (json template) AND content (markdown) extraction
We also have a SaaS offering and can deploy on premise https://nuextract.ai
Curious if others have tried different approaches to keep MTP during fine-tuning or if anyone has patched Transformers to support it properly.
[link] [comments]



