Baby Scale: Investigating Models Trained on Individual Children's Language Input

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 研究は、子どもの自然な言語入力(BabyViewの動画文字起こし)を用いて、LLMが「人間の子どもが受け取るデータ量」に近い条件でどのように学習・振る舞うかをベンチマークして「データギャップ」の正体を調べています。
  • 子どもデータで学習した言語モデルは文法課題では許容できるスケーリングを示す一方、意味や世界知識を要する課題では合成データで学習したモデルより伸びが弱いことが報告されています。
  • さらに、子どもごとの経験が反映されたデータではモデル性能に大きなばらつきがあり、データ品質を左右する言語的予測因子(分布的特徴と相互作用的特徴の組み合わせ)が重要だと示されています。
  • 個々の単語に対するモデルの尤度が、子どもがその単語を学習する度合いと相関することから、子ども向け入力の性質がモデル学習と人間の言語発達の双方に影響しうると結論づけています。

Abstract

Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.