
When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
This paper discusses a statistical framework for offline reinforcement learning using trajectory-level supervision, where only final outcomes or preferences are observed rather than step-by-step rewards. The authors introduce OPAC, a pessimistic actor-critic algorithm designed to learn from these aggregated signals by estimating latent rewards and applying pessimism to account for distribution shifts. Their analysis establishes that moving from process-level to outcome-level feedback incurs a quantifiable statistical cost, specifically an additional horizon factor in sample complexity. The research also explores generalized RL objectives, proving that non-linear outcomes like "all-success" criteria can lead to exponentially difficult learning problems. To address this, they identify specific structural coefficients, $\kappa_\mu(\sigma)$ and $\chi_\mu(\sigma)$, which determine when efficient learning remains possible. Ultimately, the paper provides a theoretical boundary for when sparse, trajectory-based data can successfully guide sequential decision-making.
Information
- Show
- FrequencyUpdated Daily
- PublishedJune 27, 2026 at 5:11 AM UTC
- Length19 min
- RatingClean