Release v5.6.0
New Model additions
OpenAI Privacy Filter
OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable. The model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure, predicting probability distributions over 8 privacy-related output categories for each input token.
Links: Documentation
QianfanOCR
Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by Baidu that performs direct image-to-text conversion without traditional multi-stage OCR pipelines. It supports a broad range of prompt-driven tasks including structured document parsing, table extraction, chart understanding, document question answering, and key information extraction all within one unified model. The model features a unique "Layout-as-Thought" capability that generates structured layout representations before producing final outputs, making it particularly effective for complex documents with mixed element types.
Links: Documentation | Paper
SAM3-LiteText
SAM3-LiteText is a lightweight variant of SAM3 that replaces the heavy SAM3 text encoder (353M parameters) with a compact MobileCLIP-based text encoder optimized through knowledge distillation, while keeping the SAM3 ViT-H image encoder intact. This reduces text encoder parameters by up to 88% while maintaining segmentation performance comparable to the original model. The model enables efficient vision-language segmentation by addressing the redundancy found in text prompting for segmentation tasks.
Links: Documentation | Paper
- Add SAM3-LiteText (#44320) by @NielsRogge in #44320
SLANet
SLANet and SLANet_plus are lightweight models designed for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The model improves accuracy and inference speed by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. SLANet was developed by Baidu PaddlePaddle Vision Team as part of their table structure recognition solutions.
Links: Documentation
- [Model] Add SLANet Model Support (#45532) by @zhang-prog in #45532
Breaking changes
The internal rotary_fn is no longer registered as a hidden kernel function, so any code referencing self.rotary_fn(...) within an Attention module will break and must be updated to call the function directly instead.
Serve
The transformers serve command received several enhancements, including a new /v1/completions endpoint for legacy text completion, multimodal support for audio and video inputs, improved tool-calling via parse_response, proper forwarding of tool_calls/tool_call_id fields, a 400 error on model mismatch when the server is pinned to a specific model, and fixes for the response API. Documentation was also updated to cover new serving options such as --compile and --model-timeout.
- Add /v1/completions endpoint (OpenAI legacy completions API) to
transformers serve(#44558) by @rain-1 in [#44558] - Updated the image cache for Paddle models according to the latest API (#45562) by @zhang-prog in [#45562]
- Raise 400 on model mismatch when
transformers serveis pinned (#45443) by @qgallouedec in [#45443] - [serve] Update tool call to switch to
parse_response(#45485) by @SunMarc in [#45485] - Fix response api support (#45463) by @SunMarc in [#45463]
- [serve] Forward
tool_calls/tool_call_idin processor inputs (#45418) by @qgallouedec in [#45418] - refactor(qa): extend extras so ty can run on server modules (#45456) by @tarekziade in [#45456]
- Multimodal serve support (#45220) by @SunMarc in [#45220]
- [docs] transformers serve (#45174) by @stevhliu in [#45174]
Vision
Several vision-related bug fixes were applied in this release, including correcting Qwen2.5-VL temporal RoPE scaling for still images, fixing missing/mismatched image processor backends for Emu3 and BLIP, resolving modular image processor class duplication, and preventing accelerate from incorrectly splitting vision encoders in PeVideo/PeAudioVideo models. Image loading performance was also improved by leveraging torchvision's native decode_image in the torchvision backend, yielding up to ~17% speedup over PIL-based loading.
- Revert "Fix: modular image processors (#45492)" (#45531) by @tarekziade in [#45531]
- Fix: modular image processors (#45492) by @zucchini-nlp in [#45492]
- fix: prevent accelerate from splitting vision encoder by setting no… (#43047) by @ in [#43047]
- Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by @Kash6 in [#45330]
- Use torchvision
decode_imageto load images in the torchvision backend (#45195) by @yonigozlan in [#45195] - Fix missing image processors backends (#45165) by @zucchini-nlp in [#45165]
Parallelization
Fixed several bugs affecting distributed training, including silently wrong results or NaN loss with Expert Parallelism, NaN weights on non-rank-0 FSDP processes, and a resize failure in PP-DocLayoutV3; additionally added support for loading adapters with Tensor Parallelism, added MoE to the Gemma4 TP plan, and published documentation for TP training.
- Fix EP: RouterParallel shape, tp_plan property, grouped_mm sentinels (#45473) by @AmineDiro in [#45473]
- Fix NaN weights on non-rank-0 FSDP processes (#45050) by @albertvillanova in [#45050]
- Load adapter with TP (#45155) by @michaelbenayoun in [#45155]
- [docs] tp training (#44613) by @stevhliu in [#44613]
- Fix resize failure caused by zero-sized masks in PP-DocLayoutV3 (#45281) by @zhang-prog in [#45281]
- Add MoE to Gemma4 TP plan (#45219) by @sywangyi in [#45219]
Tokenization
Fixed a docstring typo in streamer classes, resolved a Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError, and patched a streaming generation crash for Qwen3VLProcessor caused by incorrect _tokenizer attribute access. Additional housekeeping included moving the GPT-SW3 instruct tokenizer to an internal testing repo and fixing a global state leak in the tokenizer registry during tests.
- [Doc] Fix 'tokenized' -> 'tokenizer' typo in streamer docstrings (#45508) by @avasis-ai in [#45508]
- Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError (#45359) by @ArthurZucker in [#45359]
- fix(serving): resolve rust tokenizer from ProcessorMixin in streaming generation (#45368) by @sharziki in [#45368]
- [
Tokenizers] Move gpt sw3 tokenizer out (#45404) by @vasqu in [#45404] - fix: leak in tokenizer registry for
test_processors(#45318) by @tarekziade in [#45318]
Cache
Cache handling was improved for Gemma4 and Gemma3n models by dissociating KV state sharing from the Cache class, ensuring KV states are always shared regardless of whether a Cache is used. Additionally, the image cache for Paddle models was updated to align with the latest API.
- Align gemma3n cache sharing to gemma4 (#45489) by @Cyrilvallez in [#45489]
- remove cache file from tree (#45392) by @tarekziade in [#45392]
- [gemma4] Dissociate kv states sharing from the Cache (#45312) by @Cyrilvallez in [#45312]
Audio
Audio models gained vLLM compatibility through targeted fixes across several model implementations, while reliability improvements were also made including exponential back-off retries for audio file downloads, a crash fix in the text-to-speech pipeline when generation configs contain None values, and corrected test failures for Kyutai Speech-To-Text.
- feat[vLLM × v5]: Add vLLM compatibility for audio models (#45326) by @harshaljanjani in [#45326]
- http retries on audio file downloads (#45126) by @tarekziade in [#45126]
- fix(testing): Fix Kyutai Speech-To-Text and LongCatFlash test failures on main CI (#44695) by @harshaljanjani in [#44695]
- Fix
text-to-speechpipeline crash when generation config containsNonevalues (#45107) by @jiqing-feng in [#45107]
Bugfixes and improvements
- [
Privacy Filter] Add model (#45580) by @vasqu in [#45580] - Add ForSequenceClassification heads for the OLMo family (#45551) by @earino in [#45551]
- Add IndexCache support for GLM5 DSA (#45424) by @louzongzhi in [#45424]
- Fix redundant logic in video processing SmolVLM (#45272) by @yonigozlan in [#45272]
- Fix typos (#45574) by @vasqu in [#45574]
- [Model] Add SLANet Model Support (#45532) by @zhang-prog in [#45532]
- refactor(Dots1): drop Dots1MoE override to
pass(inherits from DSV3 MoE) (#45572) by @casinca in [#45572] - perf: avoid recomputing rotary_emb for each layer in some Google and ModernBERT models (#45555) by @casinca in [#45555]
- Gemma4 training with text-only samples (#45454) by @zucchini-nlp in [#45454]
- [nemotron_h] Add support for MLP mixers (#44763) by @xenova in [#44763]
- add expert parallelism for gemma-4-26B-A4B-it (#45279) by @sywangyi in [#45279]
- Add full GGUF loading support for GPT‑OSS (fixes #43366, supersedes #43757) latest (#45506) by @sirzechs66 in [#45506]
- Update Gemma4 weight conversion script (#45328) by @RyanMullins in [#45328]
- Move some conversion mappings to PrefixChange (#45567) by @Cyrilvallez in [#45567]
- fix table update versions (#45544) by @tarekziade in [#45544]
- Add disable_mmap kwarg to from_pretrained with hf-mount auto-detection (#45547) by @rtrompier in [#45547]
- fix(DSV3): parity between native
DeepseekV3MoEand remote official implementation (#45441) by @casinca in [#45441] - [modular] Fix modular logic broken in #45045 (#45539) by @Cyrilvallez in [#45539]
- Fix: propagate quantization_config to text sub-config for composite models in AutoModelForCausalLM (#45494) by @lvliang-intel in [#45494]
- T5Gemma2: fix
prepare_decoder_input_ids_from_labels(#45516) by @Tokarak in [#45516] - [Trainer] Add ddp_static_graph option (#45519) by @KeitaW in [#45519]
- Add dtype config options for Four Over Six (#45367) by @jackcook in [#45367]
- [Sam3LiteText] Remove unnecessary modules/configs (#45535) by @yonigozlan in [#45535]
- Fix conditional check for float formatting (#44425) by @qgallouedec in [#44425]
- Fix AMD CI: rebuild torchvision with libjpeg + refresh expectations (#45533) by @Abdennacer-Badaoui in [#45533]
- Reapply modular to examples (#45527) by @Cyrilvallez in [#45527]
- qa: re-run modular converter when the script itself is modified (#45528) by @tarekziade in [#45528]
- [GGUF] Reduce peak RAM usage by casting dequantized tensors early during load (#45386) by @UsamaKenway in [#45386]
- Fix CSM
TextToAudioPipelinemissing<bos>token (#45525) by @jiqing-feng in [#45525] - [
Conversion Mapping] Small fixups (#45483) by @vasqu in [#45483] - fix: return empty tuple from import_protobuf_decode_error when protobuf is unavailable (#45486) by @jw9603 in [#45486]
- throw error when conversion required (#45078) by @itazap in [#45078]
- chore: bump doc-builder SHA for PR upload workflow (#45450) by @rtrompier in [#45450]
- xpu output align with cuda in test case (#45526) by @sywangyi in [#45526]
- chore(qa): split out mlinter (#45475) by @tarekziade in [#45475]
- [loading] Clean way to add/remove full parts in checkpoint names (#45448) by @Cyrilvallez in [#45448]
- Fix Zamba2MambaMixer ignoring use_mamba_kernels=False (#44853) by @sergiopaniego in [#44853]
- revert sha commit pointing to main for transformers_amd_ci_ workflows (#45495) by @paulinebm in [#45495]
- Fix ZeRO-3 from_pretrained: load registered buffers in _load_state_dict_into_zero3_model (#45402) by @saslifat-gif in [#45402]
- Remove redundant condition checks in
get_image_sizemethod (#45461) by @JiauZhang in [#45461] - Add check-auto in repo-consistency and fix sorting (#45481) by @zucchini-nlp in [#45481]
- Fix typos in src/transformers/utils/output_capturing.py (#45269) by @ryota-komatsu in [#45269]
- typing: rule 15 - checks for tie_word_embeddings presence (#44988) by @tarekziade in [#44988]
- [CB] Fix capture of max_seqlen (#45323) by @remi-or in [#45323]
- Minor update (#45484) by @ydshieh in [#45484]
- Add Neuron to auto-compile hardware list (#44757) by @dacorvo in [#44757]
- Allow loading Qwen Thinker 'base' models without generative head (#45457) by @tomaarsen in [#45457]
- [
fix] Always early return for non-Mistral models in _patch_mistral_regex (#45444) by @tomaarsen in [#45444] - Fix spurious position_ids warnings for at least 40 architectures (#45437) by @tomaarsen in [#45437]
- [
fix] Make Qwen2_5OmniProcessor warning a lot less noisy via warning_once (
