Semantic video search using local Qwen3-VL embedding, no API, no transcription

Reddit r/LocalLLaMA / 3/31/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Qwen3-VL-Embedding を使って、動画を文字起こしやフレームのキャプション無しで直接ベクトル化し、自然言語クエリで“セマンティック動画検索”する手法を紹介しています。
8Bモデルは約18GB RAM、2Bモデルは約6GB RAMで動作し、Apple Silicon（MPS）やCUDA環境でもローカル実行して実用的な検索結果が得られたと述べています。
CLI ツール「SentrySearch」を作成し、ChromaDBで映像をインデックスして検索し、マッチしたクリップを自動でトリミングするワークフローを提供しています。
当初はクラウドの埋め込みAPI（Gemini）をベースにしていたが、ユーザー要望によりローカルQwenバックエンドを追加したと説明しています。

Semantic video search using local Qwen3-VL embedding, no API, no transcription

I've been experimenting with Qwen3-VL-Embedding for native video search, embedding raw video directly into a vector space alongside text queries. No transcription, no frame captioning, no intermediate text. You just search with natural language and it matches against video clips.

The surprising part: the 8B model produces genuinely usable results running fully local. Tested on Apple Silicon (MPS) and CUDA. The 8B model needs ~18GB RAM, the 2B runs on ~6GB.

I built a CLI tool around this (SentrySearch) that indexes footage into ChromaDB, searches it, and auto-trims the matching clip. Originally built on Gemini's embedding API, but added the local Qwen backend after a lot of people asked for it.

Has anyone else been using Qwen3-VL-Embedding for video tasks? Curious how others are finding the quality vs the cloud embedding models.

(Demo video attached, note this was recorded using the Gemini backend, but the local backend works the same way with the --backend local flag)

submitted by /u/Vegetable_File758
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/31DailyView insight →

Black Hat Asia

AI Business

Claude Code tokens: what they are and how they're counted

Dev.to

How I Review AI-Generated Pull Requests (A Step-by-Step Checklist)

Dev.to

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Dev.to

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Reddit r/artificial

Semantic video search using local Qwen3-VL embedding, no API, no transcription

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

Claude Code tokens: what they are and how they're counted

How I Review AI-Generated Pull Requests (A Step-by-Step Checklist)

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer