How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits

arXiv cs.CV / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key limitation of street-view perception models: they can predict subjective attributes like safety at scale, but they do not causally identify which localized visual edits would plausibly change human judgement for a given scene.
  • It proposes a “lever-based” interventional counterfactual framework that turns scene-level explainability into a constrained search over structured, localized counterfactual edits.
  • Each lever is defined by a semantic concept plus spatial support and an intervention direction, and candidate edits are generated via prompt-conditioned image editing while being filtered through validity checks (same-place preservation, locality, realism, and plausibility).
  • In a pilot study across 50 scenes from five cities, the method surfaces preliminary directional patterns and a failure taxonomy for prompt-only editing, with Mobility Infrastructure and Physical Maintenance producing the largest auxiliary safety shifts.
  • The authors note that human pairwise judgements will serve as the ground-truth endpoint for future validation of the counterfactual explanations.

Abstract

Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.