V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 料理完了後の単一画像に依存する既存の栄養推定は、油・ソース・混成成分などが調理後に視覚的に曖昧になるため限界があると指摘しています。
  • 本論文は、エゴセントリック(手元視点)調理動画の情報を活用して、調理プロセスが料理全体(dish-level)のカロリーやマクロ推定に寄与し得るかを検証します。
  • HD-EPICデータセットを追加で手動注釈し、動画ベースの栄養推定に関する初のベンチマークを構築したとしています。
  • 提案手法V-Nutriは、Nutrition5Kで事前学習した視覚バックボーンと、最終フレームに加えて調理プロセスの主要フレーム(keyframes)を統合する軽量フュージョンモジュールを組み合わせます。
  • さらにVideoMambaを用いたイベント検出(材料投入の瞬間を対象)を組み込み、プロセス・キーフレームが有効な場合がある一方で、バックボーン能力とイベント検出品質への依存が大きいことを示しています。

Abstract

Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.