| So, here's an update to my GRPO training on length constrained reddit posts summarization on 3x Mac minis - a new direction!
So, once all the t-test and evals were done for LFM2.5.-350M and Qwen2.5-0,5B-Instruct models with length penalty and quality metrics (given below), I realized after looking at the results of the quality metrics and saw that BLEU and ROUGE-L were particularly low when trained from scratch.
Well, I had a faint idea to circumvent this issue that is what if I used an already fine tuned version who outputs exactly 64 tokens? But the idea was like a flash, like zoooom and puff gone! That is when a Redditor pointed it out and I was like "hmm well I already have a checkpoint with only length penalty added!" Now here I could have just SFT'ed as some of you may be thinking to fine tune the model to output just the read number of token and yes that's next experiment along with DPO comparison ! So, currently, have been training LFM2.5-350M and Qwen2.5-0.5B-Instruct for the same!
[link] [comments] |
Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!
Reddit r/LocalLLaMA / 5/5/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The author reports ongoing experiments training small LLMs for a Reddit post summarization task with a strict 64-token output constraint using GRPO, after seeing poor BLEU and ROUGE-L results when training from scratch with a length penalty.
- They hypothesize the length-penalty design is causing low text-quality scores due to interactions like brevity penalties, and consider switching to models already fine-tuned to produce exactly 64 tokens.
- The current work continues training LFM2.5-350M and Qwen2.5-0.5B-Instruct, with a planned comparison to SFT (and DPO) approaches that better control token length.
- Evaluation is done with an “LLM-as-a-Judge” setup using GPT-5 via DeepEval, scoring summaries on faithfulness, coverage, conciseness, and clarity rather than relying solely on overlap metrics.
- Training runs on a 3-node cluster of Mac Minis using MLX, where one node drives GRPO and two nodes perform rollouts via the vLLM-metal framework, orchestrated through smolcluster with a synchronous parameter server (SyncPS) architecture.
Related Articles

Black Hat USA
AI Business

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Anthropic Launches AI Services Company with Blackstone & Goldman Sachs
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to