local natural language based video blurring/anonymization tool runs on 4K at 76 fps

Reddit r/LocalLLaMA / 4/2/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article benchmarks a locally running, natural-language-driven video anonymization tool and reports that one configuration (RF-DETR Nano Det with a skip=4 setting) can reach 76 fps at 4K.
  • It finds a clear speed-versus-flexibility tradeoff: text-prompted grounding models like Grounding DINO and Florence-2 run at about ~2 fps but allow users to describe exactly what to blur without retraining.
  • The system combines zero-shot detectors with tracking (ByteTrack) and skip-frame processing to maintain quality while reducing how often heavy detection runs, enabling real-time performance for some models.
  • It supports multiple anonymization approaches beyond bounding boxes, including instance segmentation masks (pixel-precise blurring/pixelation) and customizable blur shapes (e.g., lasso, polygon, star).
  • The tool includes multiple user interfaces (Flask web UI, a browser-based demo, and a studio/editor-style workflow) and adds additional capabilities like 360° equirectangular video support.
local natural language based video blurring/anonymization tool runs on 4K at 76 fps

It's not just a text-prompt wrapper though. I benchmarked 168 combinations (7 detectors × 3 trackers × 4 skip rates × 2 resolutions) on 4K footage:

Model Effective FPS on 4K What it does
RF-DETR Nano Det + skip=4 76 fps Auto-detect faces/people, real-time on 4K
RF-DETR Med Seg + skip=2 9 fps Pixel-precise instance segmentation masks
Grounding DINO ~2 fps Text-prompted — describe what to blur
Florence-2 ~2 fps Visual grounding with natural language
SAM2 varies Click or draw box to select what to blur

The text-prompted models (GDINO, Florence-2) are slower (~2 fps) but the flexibility is worth it — you don't need to retrain anything, just describe what you want gone.

How it works locally:

  • Grounding DINO takes your text prompt → runs zero-shot detection on each frame → ByteTrack tracks detections across frames → blur/pixelate applied with custom shapes
  • Skip-frame tracking: run detection every Nth frame, tracker interpolates the rest. Skip=4 → 4× speedup with no visible quality loss
  • All weights download automatically on first run, everything stays local
  • Browser UI (Flask) — upload video, type your prompt, process, download

Other stuff:

  • 8 total detection models (RF-DETR, YOLO, Grounding DINO, Florence-2, SAM2, MediaPipe, Cascade)
  • 360° equirectangular video support (Insta360 X5 / GoPro Max up to 8K)
  • Custom blur shapes — lasso, polygon, star, circle drawn on detected bounding boxes
  • Instance segmentation for pixel-precise masks, not just bounding boxes
  • 3 interfaces: full studio editor, simple upload-and-process, real-time MJPEG streaming demo

python -m privacy_blur.web_app --port 5001 

Runs entirely local. Repo has GIFs comparing all the model approaches side by side on the same 4K frame.

Github link

Curious what text prompts people would want to use for anonymization; the Grounding DINO integration can detect basically anything you can describe.

Yet user preferences are different so what would be most usecases and would it help if hosted a website like Photopea is there a demand for this?

submitted by /u/Honest-Debate-6863
[link] [comments]