Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper introduces a framework that uses multimodal LLMs with Google Street View imagery to automatically assess building conditions across the United States.
  • Fine-tuning Gemma 3 27B on a relatively small human-labeled dataset yields strong agreement with human mean opinion scores, surpassing individual raters on SRCC and PLCC versus the MOS benchmark.
  • To reduce latency and cost, the authors use knowledge distillation to compress the approach from Gemma 3 27B to a Gemma 3 4B model with roughly 3x faster performance while maintaining comparable accuracy.
  • They further distill the model into CNN- and transformer-based variants (EfficientNetV2-M and SwinV2-B), achieving near-original performance with about a 30x speed gain.
  • The work also evaluates LLMs on a broad set of built-environment and housing attributes via a human-AI alignment study and provides a visualization dashboard to support homeowners and downstream analysis.

Abstract

We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.