| NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames. Specifically, NVILA-HD-Video uses AutoGaze to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well. This model is for research and development only. [link] [comments] |
nvidia/NVILA-8B-HD-Video · Hugging Face
Reddit r/LocalLLaMA / 3/12/2026
📰 NewsModels & Research
Key Points
- NVILA-HD-Video is an 8B-parameter multimodal LLM capable of understanding and answering questions about videos up to 4K resolution and 1K frames.
- It uses AutoGaze to reduce redundant video patches before running the ViT or LLM, achieving up to 100x token reduction and latency improvements of up to 19x for ViT and 10x for the LLM.
- The model demonstrates improved performance on benchmarks such as VideoMME and achieves state-of-the-art results on the HLVid high-resolution long-form video benchmark.
- The model is released for research and development only and is hosted on Hugging Face by Nvidia.
Related Articles
[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)
Reddit r/MachineLearning
My Experience with Qwen 3.5 35B
Reddit r/LocalLLaMA

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4
VentureBeat
Qwen3.5 Best Parameters Collection
Reddit r/LocalLLaMA
Will Gemma 3 12B be the best all-rounder(no coding) during Iran's internet shutdowns on my RTX 4060 laptop?
Reddit r/LocalLLaMA