1. Entropy bonus and PPO
Let's dive deeper into PPO with the entropy bonus.
2. Entropy bonus
Policy gradient algorithms such as A2C may collapse too early into deterministic behaviors, assigning a 0 probability to some actions. It is as if your Mars rover, having made most of its recent progress by going forward, reduced over time to zero the probability of going backwards, assuming wrongly that it is never a good action: yet backtracking can be a vital course of action when faced with an obstacle.
Adding an entropy bonus to the objective function helps avoid this, and encourages exploration.
The entropy of a probability distribution is a concept from information theory. It measures the uncertainty of its outcome.
3. Entropy of a probability distribution
For simplicity, we examine the case of a discrete variable. The entropy is defined as minus the sum, over all possible values x, of the probability of x times the log probability of x. If we use the base 2 logarithm, the resulting quantity is measured in bits; if we use the natural logarithm it is measured in another unit, called 'nats'. One nat is one bit over log of 2, or approximately 1.44 bit.
As an example, a policy with probability uniformly spread across 4 actions has an entropy of 2 bits. If spread only between 2 actions, such as a coin flip, the entropy is one bit. Finally, if the policy is fully deterministic, its entropy is zero.
4. Implementing the entropy bonus
In PyTorch, the Categorical distribution that we have been using in select_action has a method to calculate the entropy. We can obtain the entropy with action_dist.entropy(), and add it to the tuple returned by the select_action function.
In the training loop, after the loss calculations, we subtract from the actor loss the entropy multiplied by a factor c_entropy.
c_entropy is a new hyperparameter, for which a typical value might be around 0.01. A higher value for c_entropy will encourage high-entropy policies, promoting more exploration and discouraging deterministic behavior.
Note that Categorical.entropy() returns entropy in nats. To obtain entropy in bits, you can divide by math.log(2). In practice this only matters when we want to interpret and compare entropy numbers.
5. PPO training loop
At this point, for simplicity we keep the training loop identical to what we used for A2C.
6. Towards PPO with batch updates
Note that this training loop structure is not taking full advantage of the clipped surrogate objective function. We are updating the policy at every step; meaning that theta_old in the probability ratio actually coincides with theta.
To make PPO really shine, we need to use more complex training loops where parameter updates (every step, or minibatch) are decoupled from less frequent policy updates (every episode, or after an arbitrary sequence of experiences termed "rollout").
In the next video, we will touch on such alternative training loop architectures, and explain why they lead to further stability and efficiency gains.
7. Let's practice!
Let's practice!