Training LFM-2.5-350M on Reddit post summarization with GRPO on my 3x Mac Minis — final evals and t-test evals are here [P]

Reddit r/MachineLearning / 4/25/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post reports experiments training a small LLM (“LFM-2.5-350M”) for Reddit post summarization with strict length constraints (about 64 tokens) using GRPO, to test whether tiny models can produce high-quality concise summaries under tight output limits.
Two reward setups were compared: one using only a length penalty, and another combining length penalty with a quality reward derived from metrics like ROUGE-L/METEOR (and also BLEU in other variants).
LLM-as-a-judge evaluation using DeepEval metrics (Consciencess, Coverage, Clarity, and Faitfullness) is used to compare variants, with the best-performing configuration reaching a composite score around 2.769/4 versus 2.23/4 for length penalty alone.
Additional results include t-test-based ranking across multiple reward configurations, showing that incorporating quality rewards (not just length) improves composite scores and faithfulness-related measures, though pass rates vary across reward types.
The experiments were run on the author’s hardware (three Mac Minis), and the results are shared as “final evals” and “t-test evals,” providing a practical reference for GRPO reward design in constrained summarization tasks.

Training LFM-2.5-350M on Reddit post summarization with GRPO on my 3x Mac Minis — final evals and t-test evals are here [P]

So, with this project I want to see if a length constrained (like 64 tokens only) quality summarization can be done by tiny LLMs using GRPO!

https://preview.redd.it/zynqkm0osaxg1.png?width=2816&format=png&auto=webp&s=7790bcdb17ddf57cd5e9c1037885127b6d5452e5

So, I trained two variants of this task:

using just length penalty
using a single quality reward/combination of those and length penalty

I ran LLM-As-A-Judge eval for checking the summarization quality using DeepEval tools. Those are:

Consciencess
Coverage
Clarity
Faitfullness

Th results are as attached and the final one is follows:

with quality (ROUGE-L + METEOR) + length penalty rewards: 2.7/4 (wins again!)
with just length penalty: 2.23/4

Ranking of t-test for other rewards:

Summary Table

Reward Configuration	Composite	Faithfulness	Coverage	Conciseness	Clarity	Pass Rate
`length-quality-meteor-rouge` ⭐	2.769	0.832	0.511	0.659	0.767	44.3%
`length-quality-bleu-rouge`	2.732	0.810	0.502	0.650	0.770	39.1%
`length-quality-meteor-bleu`	2.664	0.792	0.468	0.648	0.756	38.3%
`length-quality-rouge-l`	2.555	0.725	0.415	0.637	0.778	32.4%
`length-quality-meteor`	2.484	0.721	0.427	0.625	0.711	—
`length-quality-bleu`	2.400	0.680	0.399	0.577	0.744	26.9%
`length-only` (baseline)	2.416	0.678	0.407	0.592	0.739	30.7%

Performed on the test sample of 200 of smoltldr dataset. Baseline: length penalty only

All the code and wandb charts in the comments!

Setup: 3x Mac Minis in a cluster running MLX.

One node drives training using GRPO, two push rollouts via vLLM-metal framework. All of the work done using smolcluster.com.

Used SyncPS arch which is synchronous parameter server architecture with the master as the node where the training happens and the vllm on the workers nodes.

Eval:

LLM-as-a-Judge (gpt-5)

Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

Faithfulness — no hallucinations vs. source Coverage — key points captured Conciseness — shorter, no redundancy Clarity — readable on its own

The composite score is the mean of the above scores.

Reward system

length_penalty : basically, -abs(response_length - MAX_LENGTH)

quality_rewards:

ROUGE-L only cares about the longest common subsequence — it misses synonyms and paraphrases entirely.

METEOR handles both: it aligns tokens with synonym matching via WordNet and balances precision + recall with a chunk-order penalty.

BLEU on the other hand, focuses more on n-gram precision and length penalty.

submitted by /u/East-Muffin-6472
[link] [comments]