LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

arXiv cs.AI / 3/23/2026

💬 OpinionModels & Research

共有:

Key Points

LoFi introduces a location-aware fine-grained representation learning framework for chest X-rays that uses region-level supervision via a location-aware captioning loss to improve grounding and dense captioning.
The approach jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model to learn fine-grained, region-specific representations.
A fine-grained encoder is integrated into retrieval-based in-context learning to enhance chest X-ray grounding across diverse clinical settings.
Experiments on MIMIC-CXR and PadChest-GR demonstrate superior retrieval and phrase grounding performance, highlighting practical improvements in fine-grained medical image understanding.

Abstract

Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.

[D] Matryoshka Representation Learning

Reddit r/MachineLearning

Two new Qwen3.5 “Neo” fine‑tunes focused on fast, efficient reasoning

Reddit r/LocalLLaMA

HKIC, Gobi Partners and HKU team up for fund backing university research start-ups

SCMP Tech

Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling

MarkTechPost

Streaming experts

Simon Willison's Blog

LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

Key Points

Abstract

Related Articles

[D] Matryoshka Representation Learning

Two new Qwen3.5 “Neo” fine‑tunes focused on fast, efficient reasoning

HKIC, Gobi Partners and HKU team up for fund backing university research start-ups

Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling

Streaming experts

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer