SketchVLM: Vision language models can annotate images to explain thoughts and guide users

arXiv cs.CV / 4/28/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

SketchVLM is a training-free, model-agnostic framework that lets vision-language models annotate input images with editable SVG overlays instead of only returning text explanations.
The approach is designed to be non-destructive and verifiable, producing visual reasoning artifacts such as labels, connections, and shape sketches that overlay the original image.
Experiments on seven benchmarks show up to +28.5 percentage points improvement in visual reasoning accuracy and up to 1.48× better annotation quality versus comparison baselines.
The generated annotations are reported to be more faithful to the model’s stated answers, with strong results achievable in single-turn generation and additional benefits from multi-turn interaction.
An interactive demo and code are provided at the project site to enable users to try the method and reproduce results.

Abstract

When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/28DailyView insight →

Black Hat USA

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

How I Automate My Dev Workflow with Claude Code Hooks

Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat USA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

How I Automate My Dev Workflow with Claude Code Hooks

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer