1. Proximal policy optimization
Welcome back! In this video we introduce Proximal Policy Optimization or PPO.
2. From A2C to PPO
A2C combines the strengths of policy gradient and deep Q-learning.
However, its policy updates are based on action probabilities and advantage estimates, which are learned simultaneously and can be volatile. This may result in large and unstable policy updates, harming performance.
Imagine controlling a Mars rover exploring the Martian landscape. Each policy update can make the rover take large steps, which is risky in the rough, unpredictable Martian terrain and might cause the rover to get stuck.
PPO sets limits on how much the rover changes its direction or speed in one go. This prevents sudden, large changes that could lead to trouble. The rover adjusts its path gradually, ensuring safer navigation.
3. The probability ratio
The key innovation of PPO lies in its objective function.
At its core, is the ratio r_t between the action probability from policy pi_theta and from policy pi_theta_old, where theta_old is the value of theta from the last policy update. In other words, for some new value theta, it indicates how much more likely we would be to select the action a_t with the new theta than with the old theta.
In Python, we have been working with log probabilities, so we should take the ratio of their exponentials -- or equivalently the exponential of their difference.
The gradient must flow only through the numerator. The denominator is treated as a constant, so we use detach to prevent it from contributing to the gradient.
4. Clipping the probability ratio
PPO also uses the clipped ratio.
The clip function imposes a lower and upper bound to its argument. In this example, f(x) equals x between 0.8 and 1.2; it is 0.8 below that range, and 1.2 above.
Clipping forces the ratio to stay between 1-epsilon and 1+epsilon. Epsilon is a new hyperparameter controlling how aggressive the clipping is. A typical value is 0.2.
In Python, this is done with the torch.clamp function.
5. The calculate_ratios function
We can put these steps in a dedicated calculate_ratios function, taking the log probabilities and the epsilon parameter as input and returning the ratio and clipped ratio.
In this example, with epsilon = .2, a probability ratio of 1.25 gets clipped to 1.2.
6. The PPO objective function
The expected value of the ratio times the advantage (for example the TD error) is a suitable alternative objective function to the one used in REINFORCE and A2C. If we do not constrain it, however, it may still suffer from large unstable updates.
In Python, we multiply the ratio by the TD error. We detach the TD error to prevent the gradient from propagating to the critic weights.
We repeat for the clipped ratio.
Taking the minimum of both terms, we obtain the PPO's clipped surrogate objective function.
The formula is shown, but you are not required to fully understand it to complete the course.
Remember that clipping the ratio makes PPO's policy updates more stable than A2C.
7. PPO loss calculation
Let's calculate the loss. TD error calculation is identical to A2C.
Applying the steps we detailed previously, we calculate the ratio, the clipped ratio, and ultimately the objective.
We want to maximize this, but torch expects a loss to minimize: so we take the negative.
The critic loss is unchanged from A2C.
8. Let's practice!
Let's practice!