Get startedGet started for free

Policy gradient and REINFORCE

1. Policy gradient and REINFORCE

Now that we have all the key notions in place, let us implement our first Policy Gradient algorithm: REINFORCE.

2. Differences with DQN

Here are a few key differences between REINFORCE and DQN. First, REINFORCE is a Monte-Carlo method, whereas DQN was based on Temporal Difference. Monte-Carlo means that we update the network at the end of the episode, once the entire trajectory and episode return have been observed, rather than at every step. Though we don't do it here, it is also common practice to update after several episodes instead; in that case we average the loss over multiple episodes before taking a gradient descent step. In REINFORCE, there is no value function, no target network, no epsilon-greediness and no experience replay; these are largely specific to value-based methods.

3. The REINFORCE training loop structure

Let us review the REINFORCE training loop at a high level first. As usual, we have an outer loop over episodes, and an inner loop over steps. At each step, we select our action. We play the action and observe the reward and next state; we include the reward in the calculation for the episode return. Finally, when the episode is over we calculate the loss and update the policy network by gradient descent.

4. Action Selection

To implement the select_action function in REINFORCE, we obtain the action probabilities from the network, use Categorical to obtain the action distribution, and sample one action. We use the log_prob method to obtain the log probability of the sampled action. We return the action, along with its log probability as a one-dimensional vector for later use. When we apply the select_action function, the output might look like this: the action index as an integer, and the log probabilitiy as a negative real number.

5. Loss Calculation

Recall the policy gradient theorem. We cannot calculate directly the expected value on the right-hand side. However, we can calculate, for each episode, the term that is being averaged. REINFORCE proposes to use, as loss for an episode, minus the episode return, multiplied by the sum of action log probabilities. In Python, once the episode is over, we can use the episode returns and the action log probabilities to calculate this loss.

6. The REINFORCE training loop

This is the REINFORCE training loop. A lot of it is familiar. At the start of each episode, we initialize the episode log probabilities to an empty tensor episode_log_probs, and the episode return to 0. At each step, after selecting the action, we increment the episode return with the discounted reward, and append the action log probability to the episode_log_probs tensor using torch.cat. When the episode is over, we calculate the loss and make a gradient descent step to update the policy network.

7. Let's practice!

Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.