GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

arXiv cs.AI / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • この論文は、LLMの学習コストを下げるために、学習データから代表的な小規模サブセット(coreset)を選びつつ学習を効率化する枠組みを提案している。
  • 既存手法の「学習の動的変化に追従できない」「大規模LLMでスケールしにくい」という課題に対し、GRACEはグラフ誘導の適応的・動的なcoreset選択を行う。
  • GRACEは表現の多様性と勾配ベースの重要度指標を組み合わせ、coresetの有益性(informativeness)と効率(efficiency)の両立を狙っている。
  • 頻繁な更新の計算コストを抑えるために、k-NNグラフに基づく伝播(propagation)を用い、スコアや埋め込みを選択的に更新して学習ダイナミクスの変化に適応する。
  • 3つのベンチマークに関する広範な実験では、GRACEが学習効率と下流タスク性能の双方を、さまざまなLLM・タスクで改善することを示している。

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource demands and computational complexity during training, making it challenging to optimize them efficiently on large datasets. To reduce training costs while preserving performance, researchers have investigated coreset selection techniques, which aim to identify small, representative subsets of the entire training dataset to accelerate LLM training. However, existing coreset selection methods fail to adapt to the dynamic nature of LLM training and often struggle with scalability for models of this size. To address these limitations, we propose a graph-guided adaptive and dynamic coreset selection framework for LLMs, namely GRACE. GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a k-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.