In-Place Test-Time Training

arXiv cs.LG / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that the traditional “train then deploy” approach prevents LLMs from dynamically adapting to new information during real-world usage, motivating Test-Time Training (TTT).
  • It introduces “In-Place Test-Time Training” by using the final projection matrix inside MLP blocks as fast, adaptable weights in a way that is designed to be a drop-in enhancement for existing LLM architectures.
  • The authors replace TTT’s generic reconstruction goal with a next-token-prediction-aligned objective tailored to autoregressive language modeling, aiming to fix misalignment issues that hurt practical performance.
  • An efficient chunk-wise update mechanism is proposed to improve computational efficiency and to maintain compatibility with context parallelism for scalability.
  • Experiments show that applying this method can improve a 4B-parameter model on tasks with context lengths up to 128k, and training from scratch also yields consistent gains over related TTT approaches, supporting the framework as a step toward continual learning in LLMs.

Abstract

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.