A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates how Continual Pre-Training (CPT) hyperparameters—specifically the Additional Language Mixture Ratio (ALMR) of extra language/domain data—affect downstream performance.
It studies the relationship between ALMR and Learning Rate (LR) using Llama-3 8B as an experimental proxy to identify an optimal experimental setup.
After tuning hyperparameters and applying subsequent fine-tuning, the authors report improved Chinese capability as well as gains on benchmarks and domains such as math, coding, and emotional intelligence.
The resulting tuned Llama-3 70B model is deployed in a real-world chat system and shows satisfactory real-life performance, bridging experiment to deployment at larger scale.

Abstract

Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.