| NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames. Specifically, NVILA-HD-Video uses AutoGaze to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well. This model is for research and development only. [link] [comments] |
nvidia/NVILA-8B-HD-Video · Hugging Face
Reddit r/LocalLLaMA / 3/12/2026
📰 NewsModels & Research
Key Points
- NVILA-HD-Video is an 8B-parameter multimodal LLM capable of understanding and answering questions about videos up to 4K resolution and 1K frames.
- It uses AutoGaze to reduce redundant video patches before running the ViT or LLM, achieving up to 100x token reduction and latency improvements of up to 19x for ViT and 10x for the LLM.
- The model demonstrates improved performance on benchmarks such as VideoMME and achieves state-of-the-art results on the HLVid high-resolution long-form video benchmark.
- The model is released for research and development only and is hosted on Hugging Face by Nvidia.
Related Articles
Data Augmentation Using GANs
Dev.to
Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic Manipulation
arXiv cs.RO
Automatic Debiased Machine Learning for Smooth Functionals of Nonparametric M-Estimands
arXiv stat.ML
Preference-Guided Debiasing for No-Reference Enhancement Image Quality Assessment
arXiv cs.CV
Model Selection and Parameter Estimation of Multi-dimensional Gaussian Mixture Model
arXiv stat.ML