Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

Reddit r/MachineLearning / 4/15/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

A Reddit post reports successful fine-tuning of a Qwen2.5-0.5B-Instruct bf16 model for Reddit post summarization using GRPO implemented from scratch in PyTorch.
The author experiments with reward design, starting from quality_reward (ROUGE-L) plus length_penalty, then plans to use length penalty as the reward alone to test for “gaming” or degraded outputs.
Training setup uses a small ML cluster (3x Mac Minis), where one node runs GRPO training while two nodes generate rollouts via vLLM.
Two training variants are compared: length-penalty-only vs. length-penalty plus a quality reward (mentions BLEU/METEOR/ROUGE-L as options) with tracked rollout behavior (e.g., average rollout length around 64 tokens).
Evaluation is performed using LLM-as-a-judge (gpt-5) with a DeepEval-based rubric covering faithfulness, coverage, conciseness, and clarity via separate axes.

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

So, yesterday run was a success and I did get an avg rollout length of about 64 tokens as attached in the image!

This was with quality_reward + length_penalty (more info below!)

Next, I'll be going with length penalty as the reward and with the mistake of counting characters as tokens fixed and see if there is any gaming the system stuff or degraded outputs! The rewards I used were 2:

length_penalty : basically, -abs(response_length - MAX_LENGTH)
quality_reward: ROUGE-L, which is basically LCS of golden summarizations I had as part of the above dataset, to ensure we have some structure throughout the responses generated
Setup: 3x Mac Minis in a cluster running MLX.

One node drives training using GRPO, two push rollouts via vLLM. Trained two variants: