| So, a few days back I shared a post where I trained a tiny Qwen2.5-0.5B-Instruct model on smoltldr (reddit post summarization dataset of 2k rows), to output summaries of about 64 max length using RLVR with GRPO . However, there was a catch!
Hence the charts showed a sharp decline and convergence towards a response length of on and off 15 tokens. The rewards I used were 2:
Trained to one full epoch with a batch size of 2 max (before getting a OOM), the results were identical to the previous run, however, with one crucial difference -
Anyways, next up:
[link] [comments] |
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO
Reddit r/LocalLLaMA / 4/13/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- Redditの投稿者が、smoltldr(Reddit投稿の要約データ2k行)を用いてQwen2.5-0.5B-Instructのbf16小型モデルをGRPO(RLVR)で学習したと報告している。
- MAX_LENGTHを「64 tokens」と思って設定したが「64 characters」を意図せず使ってしまい、平均生成長が10〜15トークン付近で飽和する挙動になった。
- 報酬設計は長さペナルティ(目標長からの乖離を罰則)と品質報酬(要約のROUGE-L)を併用し、品質報酬なしでは報酬を“稼ぐ”ような異常出力が出たが、併用では崩れが抑えられた。
- 次の検証として、GRPOが他の報酬ゲームを試さない理由の調査、ROUGE-L以外の評価指標の検討、LLM-as-a-judgeによる定量化、別条件(MAX_LENGTH変更やプロンプト内で報酬仕様を明示)などの計画を挙げている。
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Agentic coding at enterprise scale demands spec-driven development
VentureBeat

How to build effective reward functions with AWS Lambda for Amazon Nova model customization
Amazon AWS AI Blog

How 25 Students Went from Idea to Deployed App in 2 Hours with Google Antigravity
Dev.to