LLM-Powered Flood Depth Estimation from Social Media Imagery: A Vision-Language Model Framework with Mechanistic Interpretability for Transportation Resilience

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

FloodLlama is a fine-tuned open-source vision-language model for real-time, centimeter-resolution flood-depth estimation from single street-level images, supported by a multimodal TikTok data pipeline.
The model was trained on a synthetic dataset of about 190,000 images spanning seven vehicle types, four weather conditions, and 41 depth levels (0-40 cm at 1 cm resolution) using progressive curriculum learning and QLoRA to fine-tune LLaMA 3.2-11B Vision.
Evaluation across 34,797 trials shows depth-dependent prompt effects, with simple prompts excelling at shallow depths and chain-of-thought reasoning improving performance at greater depths; MAE is below 0.97 cm and Acc@5cm exceeds 93.7% for deep flooding.
A five-phase mechanistic interpretability framework identifies layer L23 as the critical depth-encoding transition and enables selective fine-tuning that reduces trainable parameters by 76-80% while maintaining accuracy.
The Tier 3 configuration achieves 98.62% real-world accuracy and demonstrates robustness under occlusion, validated on 676 flood frames from Detroit to show real-time, crowd-sourced feasibility.

Abstract

Urban flooding poses an escalating threat to transportation network continuity, yet no operational system currently provides real-time, street-level flood depth information at the centimeter resolution required for dynamic routing, electric vehicle (EV) safety, and autonomous vehicle (AV) operations. This study presents FloodLlama, a fine-tuned open-source vision-language model (VLM) for continuous flood depth estimation from single street-level images, supported by a multimodal sensing pipeline using TikTok data. A synthetic dataset of approximately 190000 images was generated, covering seven vehicle types, four weather conditions, and 41 depth levels (0-40 cm at 1 cm resolution). Progressive curriculum training enabled coarse-to-fine learning, while LLaMA 3.2-11B Vision was fine-tuned using QLoRA. Evaluation across 34797 trials reveals a depth-dependent prompt effect: simple prompts perform better for shallow flooding, whereas chain-of-thought (CoT) reasoning improves performance at greater depths. FloodLlama achieves a mean absolute error (MAE) below 0.97 cm and Acc@5cm above 93.7% for deep flooding, exceeding 96.8% for shallow depths. A five-phase mechanistic interpretability framework identifies layer L23 as the critical depth-encoding transition and enables selective fine-tuning that reduces trainable parameters by 76-80% while maintaining accuracy. The Tier 3 configuration achieves 98.62% accuracy on real-world data and shows strong robustness under visual occlusion. A TikTok-based data pipeline, validated on 676 annotated flood frames from Detroit, demonstrates the feasibility of real-time, crowd-sourced flood sensing. The proposed framework provides a scalable, infrastructure-free solution with direct implications for EV safety, AV deployment, and resilient transportation management.