PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

arXiv cs.AI / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

PERSA is a reinforcement learning from human feedback (RLHF) pipeline for generating programming feedback in a specific professor’s grading voice while preserving diagnostic correctness.
The method combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and PPO, with learning deliberately constrained to style-bearing parts of the transformer.
By updating only the top transformer blocks and their feed-forward projections (using parameter-efficient fine-tuning), PERSA reduces global parameter drift and improves stylistic controllability.
Experiments on APPS, PyFiXV, and CodeReviewQA show strong professor-style transfer across Llama-3 and Gemma-2 backbones, including large gains in style alignment with correctness accuracy remaining very high.
The work positions PERSA as a practical approach for personalized educational feedback that aligns both “what to say” (content accuracy) and “how to say it” (tone and structure).

Abstract

Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging. We ask how can we update an LLM for automated feedback generation to align with a target instructors style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professors grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal Policy Optimization (PPO), while deliberately constraining learning to style-bearing components. Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while retaining correctness, for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone and structure).

Black Hat USA

AI Business

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

Dev.to

AI is getting better at doing things, but still bad at deciding what to do?

Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

Key Points

Abstract

Related Articles

Black Hat USA

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

AI is getting better at doing things, but still bad at deciding what to do?

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer