Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author reports ongoing experiments training small LLMs for a Reddit post summarization task with a strict 64-token output constraint using GRPO, after seeing poor BLEU and ROUGE-L results when training from scratch with a length penalty.
They hypothesize the length-penalty design is causing low text-quality scores due to interactions like brevity penalties, and consider switching to models already fine-tuned to produce exactly 64 tokens.
The current work continues training LFM2.5-350M and Qwen2.5-0.5B-Instruct, with a planned comparison to SFT (and DPO) approaches that better control token length.
Evaluation is done with an “LLM-as-a-Judge” setup using GPT-5 via DeepEval, scoring summaries on faithfulness, coverage, conciseness, and clarity rather than relying solely on overlap metrics.
Training runs on a 3-node cluster of Mac Minis using MLX, where one node drives GRPO and two nodes perform rollouts via the vLLM-metal framework, orchestrated through smolcluster with a synchronous parameter server (SyncPS) architecture.

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

So, here's an update to my GRPO training on length constrained reddit posts summarization on 3x Mac minis - a new direction!

Gist- been trying to test how good of a summarization model can be trained for summarization using exactly 64 tokens!

So, once all the t-test and evals were done for LFM2.5.-350M and Qwen2.5-0,5B-Instruct models with length penalty and quality metrics (given below), I realized after looking at the results of the quality metrics and saw that BLEU and ROUGE-L were particularly low when trained from scratch.

I hypothesized its because of the length penalty that I added so that it outputs ex ally 64 tokens but also being penalized from the rest variation of length penalty from ROUGE-L and BLEU (brevity penalty for eg).

Well, I had a faint idea to circumvent this issue that is what if I used an already fine tuned version who outputs exactly 64 tokens? But the idea was like a flash, like zoooom and puff gone!

That is when a Redditor pointed it out and I was like "hmm well I already have a checkpoint with only length penalty added!"

Now here I could have just SFT'ed as some of you may be thinking to fine tune the model to output just the read number of token and yes that's next experiment along with DPO comparison !

So, currently, have been training LFM2.5-350M and Qwen2.5-0.5B-Instruct for the same!

Eval:

LLM-as-a-Judge (gpt-5)

Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

Faithfulness — no hallucinations vs. source
Coverage — key points captured
Conciseness — shorter, no redundancy
Clarity — readable on its own

Distributed Training Setup:

3x Mac Minis in a cluster running MLX.

One node drives training using GRPO, two push rollouts via vLLM-metal framework.

All of the work done using smolcluster.

Used SyncPS arch which is synchronous parameter server architecture with the master as the node where the training happens and the vllm on the workers nodes.

https://preview.redd.it/dy01xrra4azg1.png?width=5034&format=png&auto=webp&s=9e9165673e639c049d66ef38a0d270244c81b391

https://preview.redd.it/a9paftra4azg1.png?width=5040&format=png&auto=webp&s=96165e9698f6e017f0274953523dd3192942b53f

https://preview.redd.it/11q79tra4azg1.png?width=5040&format=png&auto=webp&s=6e09e1c7db8bdfa7ea76d3af64c5b497a505a958

submitted by /u/East-Muffin-6472
[link] [comments]