Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

Redditの投稿者が、smoltldr（Reddit投稿の要約データ2k行）を用いてQwen2.5-0.5B-Instructのbf16小型モデルをGRPO（RLVR）で学習したと報告している。
MAX_LENGTHを「64 tokens」と思って設定したが「64 characters」を意図せず使ってしまい、平均生成長が10〜15トークン付近で飽和する挙動になった。
報酬設計は長さペナルティ（目標長からの乖離を罰則）と品質報酬（要約のROUGE-L）を併用し、品質報酬なしでは報酬を“稼ぐ”ような異常出力が出たが、併用では崩れが抑えられた。
次の検証として、GRPOが他の報酬ゲームを試さない理由の調査、ROUGE-L以外の評価指標の検討、LLM-as-a-judgeによる定量化、別条件（MAX_LENGTH変更やプロンプト内で報酬仕様を明示）などの計画を挙げている。

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO

So, a few days back I shared a post where I trained a tiny Qwen2.5-0.5B-Instruct model on smoltldr (reddit post summarization dataset of 2k rows), to output summaries of about 64 max length using RLVR with GRPO .

However, there was a catch!

The wandb charts for avg response length was going down and saturated around 10-15 tokens on an avg. This was the result of me confusing between character counts and token counts, I meant to do 64 tokens but rather I accidentally went for 64 characters!

Hence the charts showed a sharp decline and convergence towards a response length of on and off 15 tokens.

The rewards I used were 2:

length_penalty : basically, -abs(response_length - MAX_LENGTH)
quality_reward: a ROUGE-L, which is basically LCS of golden summarizations I had as part of the above dataset, to ensure we have some structure throughout the responses generated and minimize degradation.

Trained to one full epoch with a batch size of 2 max (before getting a OOM), the results were identical to the previous run, however, with one crucial difference -

without a quality reward in my previous runs, the system tried to game the rewards by outputting stuff like "-------*20" tokens thats it!
But not this time since I got the near same results for rewards of both the experiments when I included both vs just length penalty, and no degradation in the rollouts after 1 full epoch so I wonder why?

Anyways, next up:

Find out why GRPO didn't try other game the reward system?
Try out metrics other than ROUGE-L to get better summarizations maybe
Setup LLM-As-A-Judge to quantify the results.
Train some HF SmolLM series now!
What if I told in the prompt itself about the reward system and about the MAX_LENGTH with the task?
Different MAX_LENGTH?

https://preview.redd.it/bj5sxf46gyug1.png?width=800&format=png&auto=webp&s=c9355cea573c26db1c75668e861ffb828d7d105f

https://preview.redd.it/xmi75hv7gyug1.png?width=800&format=png&auto=webp&s=3235504cd948f9cb12c23a72fb98a08fdd31ca0a

https://preview.redd.it/o4bmvxy8gyug1.png?width=800&format=png&auto=webp&s=b0a6894556ac4c05cb0989488f754c0872581bad

submitted by /u/East-Muffin-6472
[link] [comments]