1. Prioritized experience replay
We will now explore Prioritized Experience Replay, a refinement of the DQN algorithm that enhances Experience Replay.
2. Not all experiences are created equal
Experience replay revolutionized how DQN agents learn; however, not all experiences are created equal. Imagine studying for a final exam with limited time: uniform sampling of topics will overlook important areas. Focusing on weaker areas will likely bring you more success.
This is the principle behind Prioritized Experience Replay, or PER, which prioritizes valuable experiences. PER assigns a priority to experiences based on their "surprise" level, measured by the temporal difference error or TD error. Experiences with high TD errors indicate more learning potential for the agent and are prioritized during sampling.
3. Prioritized Experience Replay (PER)
PER implementation requires five minor adjustments:
First, each transition is assigned a priority, with new transitions given the highest priority.
Second, when transitions are sampled, this is done based on a probability distribution, which depends on the priorities. Hyperparameter alpha determines the extent to which probabilities respond to priorities. At the extreme, a null value for alpha corresponds to uniform sampling.
Third, sampled transitions get allocated a priority equal to their TD error (plus a small epsilon, for example 0.0001, to avoid edge cases).
Finally, to prevent bias in loss calculation, the transitions with high sampling probability are weighted down. The bias correction starts small (controlled by parameter beta) and increases towards beta equals 1.
4. Implementing PER
Let's review the modified ReplayBuffer class.
The constructor requires a few extra hyperparameters, but the core structure remains a deque with a limited capacity for the replay buffer. We add another deque of the same capacity to contain the priorities.
5. Implementing PER
The push method now also appends to the priorities deque, setting new transitions to the current maximum priority. This ensures new transitions have a high sampling chance for TD error evaluation.
6. Implementing PER
The sample method now maps the priorities to sampling probabilities and selects indices based on those probabilities with numpy's np.random.choice. Additionally, it uses probabilities and hyperparameter beta to compute the importance weights per the previous formula. We convert the resulting lists to torch tensors and return the sampled transitions, indices, and weights.
7. Implementing PER
Finally, two new methods are introduced: update_priorities to set priorities for the sampled transitions to their TD error plus epsilon, and increase_beta to increment beta towards one over time.
8. PER in the DQN training loop
We now need to adjust the DQN training loop.
In the pre-loop code, we instantiate our replay buffer.
At the start of every episode, we increase the buffer's beta parameter using the increase_beta method.
At every step, we push the latest transition to the buffer and sample a batch of past transitions. After calculating the TD errors, we use those to update the priorities corresponding to the sampled transitions. Finally, the loss calculation needs to include the importance weights.
9. PER In Action: Cartpole
Let's see PER in action.
We trained a DQN agent with and without PER, 100 times each, in the CartPole environment and averaged episode rewards. On average, the DQN agent converges faster with PER than with Uniform Experience Replay.
In more complex environments,
10. PER In Action: Atari environments
like Atari video games, PER can also significantly improve the agent's performance by the end of training and substantially improves average performance over DDQN, though the picture is less clear on median performance.
11. Let's practice!
Time to implement PER in the Lunar Lander environment.