Get startedGet started for free

Introduction to policy gradient

1. Introduction to policy gradient

Great work! Next, we focus on policy gradient methods, a powerful alternative to the value-based methods we've discussed.

2. Introduction to Policy methods in DRL

In Deep Q-Learning, we learned the action value function Q. The agent's policy was to take the action that maximizes the Q-value, or, with epsilon-greediness, a random action with a small probability. We can also learn the agent's policy directly. Similar to using a Q-network to learn the value function, we can use a policy network to learn the policy function; we only need to find the right loss function.

3. Policy learning

Learning the policy rather than the value function has some key advantages. The policy can be stochastic, handle continuous action spaces, and directly optimize for the objective we care about. However, policy methods can have high variance and be less sample efficient. Understanding where they excel is key. In Deep Q-learning, policies are deterministic. Now, our policy pi_?(a_t|s_t) is stochastic, describing the probability distribution of taking action a_t in state s_t, given policy parameters ?. In the context of DRL, ? represents the neural network parameters for the policy.

4. The policy network (discrete actions)

For simplicity, let's stick to discrete actions. The network takes in the state s and outputs a probability for each action. This is achieved with a softmax output layer. In this example, the action 'down', with index 2, is the most likely to be chosen, with probability 0.74. To conveniently sample an action from the policy, we can first feed the action probabilities to the torch.distribution.Categorical class to represent the distribution, and then use the .sample method to obtain an action.

5. The objective function

The core idea in policy methods is to maximize expected returns, assuming the agent follows pi theta, by optimizing the policy parameters. The objective function, J(pi theta), is the expected return under the policy. To apply maximization techniques and learn the parameters theta that maximize J, we need the gradient of J with respect to theta,

6. The objective function

known as the policy gradient. Computing this gradient is the fundamental challenge - that's where the policy gradient theorem comes in!

7. The policy gradient theorem

We will not prove the theorem here, but it is important to understand its meaning and use. The theorem gives us a tractable expression for the gradient of J, which we can use to train our policy network. It is an expectation over trajectories following the policy pi theta; in practice this means that we need to collect data by running the environment using the policy and observing the episode returns.

8. The policy gradient theorem

For each trajectory tau, we take the return R_tau accumulated over the trajectory,

9. The policy gradient theorem

and multiply it by the sum of the gradients of the log probabilities of playing each action over the trajectory. This may seem complicated but remember that a deep theoretical understanding is not necessary for this course. The intuition behind the theorem is that if we want our objective J to grow, we need to nudge theta in a way which pushes up the probability of actions that were taken in 'good' episodes with high return.

10. Interpreting the theorem

For example, if we see that an episode in which a Pong agent caught the ball brought a high return, we increase the probability of the actions that were taken in that episode.

11. Let's practice!

Let's practice!