TRPO
TRPO, short for Trust Region Policy Optimization, is a reinforcement learning algorithm designed to stabilize policy updates. It optimizes a stochastic policy by enforcing that the new policy does not deviate too much from the previous one, measured by the Kullback–Leibler (KL) divergence. The core idea is to maximize the expected return while staying within a defined trust region around the old policy.
The method formulates a constrained optimization problem: maximize a surrogate objective that estimates improvement in expected
A line search ensures that the updated policy achieves an actual improvement in the surrogate objective while
TRPO is an on-policy, model-free algorithm commonly applied with neural network policies, such as Gaussian policies