Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
arXiv cs.CV / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The authors introduce a training-only heterogeneous image-patch-text graph teacher that runs during training to capture cross-modal relations among multi-scale visual patches and text prompts.
- The teacher uses a Modality-aware Graph Transformer to perform deep cross-modal reasoning and applies discriminative node filtering to extract high-fidelity class features.
- They employ a cache-aware dual-objective strategy to supervise relational knowledge into the Tip-Adapter's key-value cache, upgrading prototypes while the graph teacher is discarded at test time with no extra inference cost.
- Experiments on standard 1-16-shot benchmarks report state-of-the-art performance and ablations show the importance of auxiliary graph supervision, text-guided reasoning, and node filtering.
- Code is available at https://github.com/MR-Sherif/TOGA.git.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to
The Three-Agent Protocol Is Transferable. The Discipline Isn't.
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA