Choose the next subgoal
Planning: the agent decides whether to continue the current objective or switch to a better one as observations arrive.
Long-horizon LLM agent training
HiPER separates high-level planning from low-level execution and trains both levels with Hierarchical Advantage Estimation.
HiPER addresses long-horizon reinforcement learning for interactive language agents by making their implicit hierarchical behavior explicit and aligning the training signal with that structure. Rather than treating the agent as a flat policy that selects one action at a time, HiPER separates high-level planning from low-level execution: at each step, the agent decides whether to keep the current subgoal or switch to a new one, then acts according to that subgoal. This Plan-Execute structure helps the agent maintain temporally extended objectives while still adapting dynamically, reducing failures such as task drift, premature switching, and repetitive ineffective actions. To train this hierarchy, HiPER introduces Hierarchical Advantage Estimation, which propagates credit both within subgoal segments and across segments, providing a more suitable learning signal than flat advantage estimation and offering benefits in unbiasedness and variance reduction.
Empirically, HiPER achieves strong results on interactive benchmarks such as ALFWorld and WebShop, reaching 97.4% success on ALFWorld and 83.3% on WebShop with Qwen2.5-7B-Instruct. Its key message is that effective long-horizon agent training requires both an explicit behavioral hierarchy and a credit assignment method matched to that hierarchy: planning alone is insufficient if learning remains flat, and better credit assignment depends on exposing the right decision structure.
HiPER improves long-horizon agent training across ALFWorld and WebShop, with stronger final success rates and more stable training curves.
HiPER turns long-horizon interaction into an explicit Plan-Execute loop, then assigns credit separately to decisions made at the planning and execution levels.
Instead of optimizing a flat stream of actions, HiPER exposes the hierarchy already used by capable agents: a high-level policy selects or maintains a subgoal, while a low-level executor takes grounded actions until the next planning boundary.
Planning: the agent decides whether to continue the current objective or switch to a better one as observations arrive.
Execution: the agent converts each subgoal into environment actions, keeping local behavior tied to the current plan.
Hierarchical Advantage Estimation propagates feedback within each subgoal segment and across segment boundaries.
Appendix E of the paper illustrates how HiPER agents keep subgoals while they remain useful and switch when the task phase changes. Select a case to view the original trajectory grouped by contiguous subgoal segments.
@article{peng2026hiper,
title={HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents},
author={Peng, Jiangweizhi and Liu, Yuanxin and Zhou, Ruida and Fleming, Charles and Wang, Zhaoran and Garcia, Alfredo and Hong, Mingyi},
journal={arXiv preprint arXiv:2602.16165},
year={2026}
}