AI Navigate

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Reddit r/MachineLearning / 3/20/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author finetuned a standard 135M parameter text language model to add vision capabilities using vision-language model adapters, demonstrating a practical small-model approach.
  • The Towards Data Science article documents each stage, including how Q-Formers work and how adapters between LMs and VLMs are trained, along with datasets used.
  • The GitHub repository for the project has been open-sourced, enabling others to reproduce or extend the workflow.
  • The post serves as a learning resource for others pursuing similar VLM-from-scratch projects by sharing notes and lessons learned.

Recently I worked on a VLM training project that took a standard 135M param text language model, and gave it vision capabilities. Wrote an article on Towards Data Science covering each stage of that project, what I learned, etc.

Article contains all my notes about how Q-Formers work, adapters between LM and VLMs are trained, datasets etc. Git repo also open sourced.

Sharing in case someone does a similar project and find it useful as a learning resource.

https://towardsdatascience.com/how-vision-language-models-are-trained-from-scratch/

submitted by /u/AvvYaa
[link] [comments]