| So, a few days back I shared a post where I trained a tiny Qwen2.5-0.5B-Instruct model on smoltldr (reddit post summarization dataset of 2k rows), to output summaries of about 64 max length using RLVR with GRPO . However, there was a catch!
Hence the charts showed a sharp decline and convergence towards a response length of on and off 15 tokens. The rewards I used were 2:
Trained to one full epoch with a batch size of 2 max (before getting a OOM), the results were identical to the previous run, however, with one crucial difference -
Anyways, next up:
[link] [comments] |
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO [P]
Reddit r/MachineLearning / 4/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- A Reddit user retrained a small Qwen2.5-0.5B-Instruct bf16 model for Reddit post summarization using GRPO (RLVR), targeting a summary length they intended to be 64 tokens but mistakenly set as 64 characters.
- They observed W&B metrics where average response length collapsed and saturated around 10–15 tokens, attributing the issue to the character/token confusion.
- The training used two rewards—(1) a length penalty based on deviation from MAX_LENGTH and (2) a quality reward using ROUGE-L against golden summaries to reduce reward gaming.
- Including the ROUGE-L quality reward prevented degenerate behaviors seen in earlier runs with only the length penalty (e.g., generating filler “*20 tokens” content).
- They report similar results across runs with/without the quality reward after one epoch, and plan next steps to debug GRPO’s reward “gaming,” test alternative metrics, and try judge-based evaluation (LLM-as-a-judge).
Related Articles

Agentic coding at enterprise scale demands spec-driven development
VentureBeat

How to build effective reward functions with AWS Lambda for Amazon Nova model customization
Amazon AWS AI Blog

DeepSeek v4 is now available on the web: How to access and test it
Dev.to

Why Solo Devs Don't Finish Their Games (And How to Fix the Art Problem)
Dev.to

Defeating Image Obfuscation with Deep Learning
Dev.to