A Multimodal Vision Transformer-based Modeling Framework for Prediction of Fluid Flows in Energy Systems

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a transformer-based, hierarchical Vision Transformer framework (SwinV2-UNet) to predict fluid flows in energy systems, targeting the high cost of conventional CFD for nonlinear, multiscale multiphysics problems.
  • It uses a multimodal learning setup with auxiliary tokens that encode both the data modality and the time increment, enabling the model to ingest multi-fidelity simulation data.
  • Experiments focus on high-pressure gas injection relevant to reciprocating engines, training separate models on multimodal datasets created from in-house CFD for argon jet injection into nitrogen.
  • The framework is evaluated on spatiotemporal rollouts (autoregressive future flow prediction) and feature transformation (inferring unobserved fields/views from limited observations).
  • Results indicate the data-driven models can generalize across different grid resolutions and modalities while accurately forecasting flow evolution and reconstructing missing flow-field information.

Abstract

Computational fluid dynamics (CFD) simulations of complex fluid flows in energy systems are prohibitively expensive due to strong nonlinearities and multiscale-multiphysics interactions. In this work, we present a transformer-based modeling framework for prediction of fluid flows, and demonstrate it for high-pressure gas injection phenomena relevant to reciprocating engines. The approach employs a hierarchical Vision Transformer (SwinV2-UNet) architecture that processes multimodal flow datasets from multi-fidelity simulations. The model architecture is conditioned on auxiliary tokens explicitly encoding the data modality and time increment. Model performance is assessed on two different tasks: (1) spatiotemporal rollouts, where the model autoregressively predicts the flow state at future times; and (2) feature transformation, where the model infers unobserved fields/views from observed fields/views. We train separate models on multimodal datasets generated from in-house CFD simulations of argon jet injection into a nitrogen environment, encompassing multiple grid resolutions, turbulence models, and equations of state. The resulting data-driven models learn to generalize across resolutions and modalities, accurately forecasting the flow evolution and reconstructing missing flow-field information from limited views. This work demonstrates how large vision transformer-based models can be adapted to advance predictive modeling of complex fluid flow systems.

A Multimodal Vision Transformer-based Modeling Framework for Prediction of Fluid Flows in Energy Systems | AI Navigate