| It's not just a text-prompt wrapper though. I benchmarked 168 combinations (7 detectors × 3 trackers × 4 skip rates × 2 resolutions) on 4K footage:
The text-prompted models (GDINO, Florence-2) are slower (~2 fps) but the flexibility is worth it — you don't need to retrain anything, just describe what you want gone. How it works locally:
Other stuff:
Runs entirely local. Repo has GIFs comparing all the model approaches side by side on the same 4K frame. Curious what text prompts people would want to use for anonymization; the Grounding DINO integration can detect basically anything you can describe. Yet user preferences are different so what would be most usecases and would it help if hosted a website like Photopea is there a demand for this? [link] [comments] |
local natural language based video blurring/anonymization tool runs on 4K at 76 fps
Reddit r/LocalLLaMA / 4/2/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The article benchmarks a locally running, natural-language-driven video anonymization tool and reports that one configuration (RF-DETR Nano Det with a skip=4 setting) can reach 76 fps at 4K.
- It finds a clear speed-versus-flexibility tradeoff: text-prompted grounding models like Grounding DINO and Florence-2 run at about ~2 fps but allow users to describe exactly what to blur without retraining.
- The system combines zero-shot detectors with tracking (ByteTrack) and skip-frame processing to maintain quality while reducing how often heavy detection runs, enabling real-time performance for some models.
- It supports multiple anonymization approaches beyond bounding boxes, including instance segmentation masks (pixel-precise blurring/pixelation) and customizable blur shapes (e.g., lasso, polygon, star).
- The tool includes multiple user interfaces (Flask web UI, a browser-based demo, and a studio/editor-style workflow) and adds additional capabilities like 360° equirectangular video support.




