HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

arXiv cs.AI / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces HiPO (Hierarchical Preference Optimization) as an extension of DPO to better align LLMs on complex multi-step reasoning tasks.
HiPO improves training granularity by splitting responses into hierarchical segments (query clarification/context, reasoning steps, and final answer) and applying a separate weighted DPO-style loss to each segment.
Unlike prior approaches that separately focus on stable preference learning (e.g., DPO variants) or structured reasoning (e.g., multi-agent RL or Tree of Thoughts), HiPO aims to combine both strengths.
Experiments on multiple 7B LLMs fine-tuned using HiPO versus DPO on the Math Stack Exchange preference dataset show consistently better performance on common math benchmarks.
Human-preference and quality proxies indicate HiPO yields responses with improved organization, logical flow, and consistency as judged by GPT-4.1.

Abstract

Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

Dev.to

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Key Points

Abstract

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer