Recently I worked on a VLM training project that took a standard 135M param text language model, and gave it vision capabilities. Wrote an article on Towards Data Science covering each stage of that project, what I learned, etc.
Article contains all my notes about how Q-Formers work, adapters between LM and VLMs are trained, datasets etc. Git repo also open sourced.
Sharing in case someone does a similar project and find it useful as a learning resource.
https://towardsdatascience.com/how-vision-language-models-are-trained-from-scratch/
[link] [comments]