nvidia/NVILA-8B-HD-Video · Hugging Face

Reddit r/LocalLLaMA / 3/12/2026

📰 NewsModels & Research

共有:

Key Points

NVILA-HD-Video is an 8B-parameter multimodal LLM capable of understanding and answering questions about videos up to 4K resolution and 1K frames.
It uses AutoGaze to reduce redundant video patches before running the ViT or LLM, achieving up to 100x token reduction and latency improvements of up to 19x for ViT and 10x for the LLM.
The model demonstrates improved performance on benchmarks such as VideoMME and achieves state-of-the-art results on the HLVid high-resolution long-form video benchmark.
The model is released for research and development only and is hosted on Hugging Face by Nvidia.

NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames.

Specifically, NVILA-HD-Video uses AutoGaze to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well.

This model is for research and development only.

submitted by /u/jacek2023
[link] [comments]

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

Reddit r/MachineLearning

My Experience with Qwen 3.5 35B

Reddit r/LocalLLaMA

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4

VentureBeat

Qwen3.5 Best Parameters Collection

Reddit r/LocalLLaMA

Will Gemma 3 12B be the best all-rounder(no coding) during Iran's internet shutdowns on my RTX 4060 laptop?

Reddit r/LocalLLaMA

nvidia/NVILA-8B-HD-Video · Hugging Face

Key Points

Related Articles

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

My Experience with Qwen 3.5 35B

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4

Qwen3.5 Best Parameters Collection

Will Gemma 3 12B be the best all-rounder(no coding) during Iran's internet shutdowns on my RTX 4060 laptop?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer