AI Navigate

Taming Vision Priors for Data Efficient mmWave Channel Modeling

arXiv cs.CV / 3/17/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • VisRFTwin is a scalable, data-efficient digital-twin framework that fuses vision-derived priors with differentiable ray tracing for mmWave channel modeling.
  • It uses multi-view images from commodity cameras processed by a frozen Vision-Language Model to derive semantic embeddings and convert them into initial estimates of permittivity and conductivity for scene surfaces.
  • A Sionna-based differentiable ray tracer is calibrated via gradient descent using only a few dozen sparse channel soundings, dramatically reducing data requirements.
  • The system retains the vision-to-material parameter associations to enable fast transfer to new scenarios without re-calibration.
  • Empirical evaluations across office interiors, urban canyons, and dynamic public spaces show up to 10x reduction in channel measurements and a 59% lower median delay spread error compared with pure data-driven deep learning methods.

Abstract

Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10\times while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.