DQN with experience replay

1. DQN with experience replay

Great work so far! Let's introduce experience replay to improve our Barebone DQN.

2. Introduction to experience replay

Our Barebone DQN agent used its latest experience at each update. But consecutive experiences are highly correlated. This isn't good for learning, as one learns better from diverse experiences. This also makes the agent forgetful. Imagine our agent navigating a maze. As it learns from new paths, it might forget crucial early junctions, leading to suboptimal decisions. Experience Replay solves this by storing the agent's experiences in a Replay Memory buffer. At each step, the agent learns from a random batch of past experiences.

3. The Double-Ended Queue

The double-ended queue, or deque, from Python's collections module, is ideal for implementing the Experience Replay memory buffer. We instantiate it, specifying its initial value as its first argument, and limiting the capacity with the maxlen argument. We add items to the right using the append or extend methods - like a list. When the deque reaches capacity, we can still append as the oldest items are automatically dropped on the left. Essentially, we add new experiences as they occur and forget the oldest ones.

4. Implementing Replay Buffer

Let's implement a Replay Buffer in Python using the random module for sampling. We define a class with a capacity parameter for the buffer size and initialize a memory attribute as an empty deque. The push method adds new experiences to the buffer. An experience is a transition tuple with the state, action, reward, next state, and a done indicator. We append the transition to the deque, which automatically drops the oldest transition when at capacity. A len method is included for capacity checking.

5. Implementing Replay Buffer

The sample method randomly draws a batch of experiences, or transitions, from the replay memory using random.sample, without replacement. The batch, a list of transition tuples, is transformed by the zip function into a tuple of lists. We unpack this tuple into the lists states, actions, rewards, next_states, and dones. We convert these into a tuple of PyTorch tensors with the shape and type required for loss calculations. These operations are repeated for all except for actions, which will later be used to select a specific index of each q_value tensor. For this, we need the actions tensor to be of dtype integer and unsqueezed into the correct shape.

6. Integrating Experience Replay in DQN

Let's integrate the Replay Buffer into our DQN training loop. At each step, the agent stores new experiences in the buffer using the push method. Training begins once the buffer exceeds the desired batch size. At each subsequent step, we sample a random batch from the buffer using the sample method. The DQN loss calculation now needs to handle batches of transitions. The steps are conceptually unchanged, but updated to work with tensors. The gather method selects the desired index for each row of a matrix. Here, we use it to select, for each transition, the q value corresponding to the selected action. The amax function takes the maximum of a tensor along a dimension. For each transition in the batch, we take the highest q value in the next state to calculate the TD target. We are still calculating the Squared Bellman Error, but now take its mean over a replay memory batch, resulting in more stable network updates and improved agent learning.

7. Let's practice!

Let's practice!