HiPER: Hierarchical Reinforcement Learning for LLM Agents

Overview

HiPER addresses long-horizon reinforcement learning for interactive language agents by making their implicit hierarchical behavior explicit and aligning the training signal with that structure. Rather than treating the agent as a flat policy that selects one action at a time, HiPER separates high-level planning from low-level execution: at each step, the agent decides whether to keep the current subgoal or switch to a new one, then acts according to that subgoal. This Plan-Execute structure helps the agent maintain temporally extended objectives while still adapting dynamically, reducing failures such as task drift, premature switching, and repetitive ineffective actions. To train this hierarchy, HiPER introduces Hierarchical Advantage Estimation, which propagates credit both within subgoal segments and across segments, providing a more suitable learning signal than flat advantage estimation and offering benefits in unbiasedness and variance reduction.

Empirically, HiPER achieves strong results on interactive benchmarks such as ALFWorld and WebShop, reaching 97.4% success on ALFWorld and 83.3% on WebShop with Qwen2.5-7B-Instruct. Its key message is that effective long-horizon agent training requires both an explicit behavioral hierarchy and a credit assignment method matched to that hierarchy: planning alone is insufficient if learning remains flat, and better credit assignment depends on exposing the right decision structure.

Performance Summary

HiPER improves long-horizon agent training across ALFWorld and WebShop, with stronger final success rates and more stable training curves.

Approach

HiPER turns long-horizon interaction into an explicit Plan-Execute loop, then assigns credit separately to decisions made at the planning and execution levels.

Instead of optimizing a flat stream of actions, HiPER exposes the hierarchy already used by capable agents: a high-level policy selects or maintains a subgoal, while a low-level executor takes grounded actions until the next planning boundary.

Plan

Choose the next subgoal

Planning: the agent decides whether to continue the current objective or switch to a better one as observations arrive.

Execute

Execute grounded actions

Execution: the agent converts each subgoal into environment actions, keeping local behavior tied to the current plan.

Credit

Train with hierarchical advantages

Hierarchical Advantage Estimation propagates feedback within each subgoal segment and across segment boundaries.

HiPER hierarchical plan-execute reinforcement learning diagram

Example Trajectories

Appendix E of the paper illustrates how HiPER agents keep subgoals while they remain useful and switch when the task phase changes. Select a case to view the original trajectory grouped by contiguous subgoal segments.

BibTeX

Citation

@article{peng2026hiper,
  title={HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents},
  author={Peng, Jiangweizhi and Liu, Yuanxin and Zhou, Ruida and Fleming, Charles and Wang, Zhaoran and Garcia, Alfredo and Hong, Mingyi},
  journal={arXiv preprint arXiv:2602.16165},
  year={2026}
}