OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • OmniFood8Kは、中国料理のカバー不足や深度センサー前提といった課題に対応するため、8,036食分の詳細栄養注釈とマルチビュー画像を備えたマルチモーダル食品データセットを提供する研究です。
  • 単一RGB画像から栄養推定を行うために、まずRGBから深度マップを予測し、SSRA(Scale-Shift Residual Adapter)で大域スケール整合と局所構造維持を強化します。
  • RGB特徴と深度特徴を周波数領域で階層的に整合・統合するFAFM(Frequency-Aligned Fusion Module)を導入し、予測精度の向上を狙います。
  • 重要な食材領域を動的なチャネル選択で強調するMPH(Mask-based Prediction Head)を用い、栄養推定をより正確にする設計になっています。
  • さらに、栄養ラベルを正確に保ったまま組成の変動を導入する合成データセットNutritionSynth-115Kも構築し、多データセットで既存手法に対する優位性を報告しています。

Abstract

Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models' capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/