OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
arXiv cs.AI / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces OptiMer, a continual pre-training approach that removes the need to choose fixed data-mixture ratios before training by training per-dataset models and then optimizing composition weights post-hoc.
- OptiMer extracts a “distribution vector” from each dataset-specific CPT model to represent the parameter shift induced by that data, and uses Bayesian optimization to find optimal weights for combining these vectors.
- Experiments with Gemma 3 (27B) on multiple languages (Japanese, Chinese) and domains (Math, Code) show OptiMer improves performance over data mixing and model averaging baselines.
- The method reduces search cost by 15–35× and yields interpretable weights that can be used as effective mixture ratios for improved data-mixture CPT through retraining.
- The same pool of distribution vectors can be re-optimized for different objectives without retraining, enabling target-tailored models on demand.
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Day 6: I Stopped Writing Articles and Started Hunting Bounties
Dev.to

Early Detection of Breast Cancer using SVM Classifier Technique
Dev.to

I Started Writing for Others. It Changed How I Learn.
Dev.to