Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
arXiv cs.CV / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The authors introduce a training-only heterogeneous image-patch-text graph teacher that runs during training to capture cross-modal relations among multi-scale visual patches and text prompts.
- The teacher uses a Modality-aware Graph Transformer to perform deep cross-modal reasoning and applies discriminative node filtering to extract high-fidelity class features.
- They employ a cache-aware dual-objective strategy to supervise relational knowledge into the Tip-Adapter's key-value cache, upgrading prototypes while the graph teacher is discarded at test time with no extra inference cost.
- Experiments on standard 1-16-shot benchmarks report state-of-the-art performance and ablations show the importance of auxiliary graph supervision, text-guided reasoning, and node filtering.
- Code is available at https://github.com/MR-Sherif/TOGA.git.
Related Articles
Automating the Chase: AI for Festival Vendor Compliance
Dev.to
MCP Skills vs MCP Tools: The Right Way to Configure Your Server
Dev.to
500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)
Dev.to
Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?
Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER