Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update

Reddit r/LocalLLaMA / 4/16/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author fine-tuned two Qwen2.5-0.5B-Instruct variants for Reddit post summarization with an explicit length constraint, comparing length-penalty-only versus quality+length rewards.
  • They evaluated summarization quality using LLM-as-a-Judge with DeepEval metrics covering Faithfulness, Coverage, Conciseness, and Clarity, then computed a composite mean score.
  • Adding a quality reward (ROUGE-L) on top of length penalty improved performance slightly (2.5/4) versus length penalty alone (2.4/5).
  • The quality+length-penalty model showed a statistically significant difference on the final composite score (one-sided t-test, p=0.0042) across five evaluation rounds on a 200-sample subset of the smoltldr dataset.
  • The post also explains the rationale for LLM-as-a-Judge as a cheaper alternative to human labels when rewards are subjective or hard to define precisely.

So, I trained two variants of this task:

  • using just length penalty
  • using a quality reward and length penalty

I ran LLM-As-A-Judge eval for checking the summarization quality using DeepEval tools. Those are:

  • Consciencess
  • Coverage
  • Clarity
  • Faitfullness

Th results are as follows:

  • with quality + length penalty rewards: 2.5/4
  • with just length penalty: 2.4/5

Results:

The model with length penalty and quality reward as ROUGE L is significant with a p-value of 0.0042 wrt the final composite score using one-sided t-test with a total of 5 rounds of evals for each model.

Performed on the test sample of 200 of smoltldr dataset.

Baseline: length penalty only

  • What is LLM-as-A-Judge?

Well, it is meant to allow any LLM of your choice to judge certain outputs which cant be easily be segregated into definitive reward because of its variance or subjective nature, like summarization!

Such rewards varies for person to person, so we employ an LLM to act like one and give rewards multiple times and aggregates the results.] which is cheap compared to human labelers!

So, I used DeepEvals amazing tools to create a eval system for me to evaluate the summarizations by my models on the aforementioned four factors:

Faithfulness: does the summary stay fully grounded in the source, with no hallucinations or contradictions?

Coverage: does the summary capture the source’s key points without missing meaning-critical information?

Conciseness: is the summary substantially shorter than the source without redundancy or unnecessary detail?

Clarity: is the summary easy to read, grammatically clean, and understandable on its own?

The composite score is the mean of the above scores.

  • Reward system

length_penalty : basically, -abs(response_length - MAX_LENGTH)

quality_reward: a ROUGE-L, which is basically LCS of golden summarizations I had as part of the above dataset, to ensure we have some structure throughout the responses generated and minimize degradation.

submitted by /u/East-Muffin-6472
[link] [comments]