Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper extends the MLX-LM framework with Universal Assisted Generation (UAG) to enable speculative decoding across mismatched tokenizers on Apple Silicon.
Using Bielik 11B-Instruct (Mistral-based) as the target and three draft models (including Bielik 1.5B, Qwen2.5-1.5B, and Llama 3.2-1B), the study evaluates k={2,4,6} draft lengths on Polish datasets (Wikipedia, pl_alpaca, synthetic).
Context-aware token translation improves acceptance rates across configurations, but the Polish-specialized Bielik 1.5B draft shows lower acceptance than general-purpose Qwen2.5 and Llama 3.2 drafts.
Throughput gains on Apple Silicon are content-dependent: up to ~1.7x speedup for structured text, with degraded performance for varied instructions, and theoretical amortization fails because both models become memory-bandwidth limited.
The authors provide a hardware-aware speedup formula and conditions for when cross-family speculative decoding is likely to work effectively on unified memory systems.

Abstract

Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM framework with Universal Assisted Generation (UAG) to enable cross-tokenizer speculative decoding on Apple Silicon. We evaluate Bielik 11B-Instruct (Mistral-based) as the target model, paired with three draft models: Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, and Llama 3.2-1B. Experiments on three Polish-language datasets (Wikipedia, pl_alpaca, synthetic) use draft lengths k in {2, 4, 6} to compare naive and context-aware token translation. Results show: (1) context-aware translation consistently improves acceptance rates across all configurations; (2) the Polish-specialized Bielik 1.5B achieves lower acceptance than general-purpose Qwen2.5 and Llama 3.2 drafters; (3) throughput on Apple Silicon is content-dependent, reaching 1.7x speedup for structured text but failing for varied instructions; and (4) verification cost on unified memory does not amortize as theory predicts because both models are memory-bandwidth bound, making sequential drafting expensive relative to batched verification. We propose a hardware-aware speedup formula and characterize conditions for cross-family speculative decoding on Apple Silicon. This is the first systematic evaluation of cross-family speculative decoding for Polish LLMs and the first empirical study of UAG-based decoding on unified memory architectures.