Foundation Models in Robotics: A Comprehensive Review of Methods, Models, Datasets, Challenges and Future Research Directions

arXiv cs.RO / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article reviews how robotics is shifting from fixed, single-task systems toward adaptive, multi-purpose general agents, largely enabled by foundation models (FMs).
  • It surveys the research evolution across five phases, from early NLP/CV integrations to today’s multi-sensory generalization and real-world deployment.
  • The review provides a detailed taxonomy covering foundation-model types (LLMs, VFMs, VLMs, VLAs), neural architectures, learning paradigms, stages of knowledge incorporation, targeted robotic tasks, and application domains.
  • It summarizes publicly available datasets used for training and evaluation, and outlines current open challenges and future research directions for FM-driven robotics.
  • The work aims to offer a holistic, comparative, and critical view of methods, models, datasets, and key gaps in the field.

Abstract

Over the recent years, the field of robotics has been undergoing a transformative paradigm shift from fixed, single-task, domain-specific solutions towards adaptive, multi-function, general-purpose agents, capable of operating in complex, open-world, and dynamic environments. This tremendous advancement is primarily driven by the emergence of Foundation Models (FMs), i.e., large-scale neural-network architectures trained on massive, heterogeneous datasets that provide unprecedented capabilities in multi-modal understanding and reasoning, long-horizon planning, and cross-embodiment generalization. In this context, the current study provides a holistic, systematic, and in-depth review of the research landscape of FMs in robotics. In particular, the evolution of the field is initially delineated through five distinct research phases, spanning from the early incorporation of Natural Language Processing (NLP) and Computer Vision (CV) models to the current frontier of multi-sensory generalization and real-world deployment. Subsequently, a highly-granular taxonomic investigation of the literature is performed, examining the following key aspects: a) the employed FM types, including LLMs, VFMs, VLMs, and VLAs, b) the underlying neural-network architectures, c) the adopted learning paradigms, d) the different learning stages of knowledge incorporation, e) the major robotic tasks, and f) the main real-world application domains. For each aspect, comparative analysis and critical insights are provided. Moreover, a report on the publicly available datasets used for model training and evaluation across the considered robotic tasks is included. Furthermore, a hierarchical discussion on the current open challenges and promising future research directions in the field is incorporated.