Advantage Actor Critic

1. Advantage Actor Critic

Welcome back!

2. Why actor critic?

REINFORCE is a good introduction to policy-gradient methods but has a few limitations. REINFORCE has high variance. In the simplest case, we use one trajectory to evaluate the policy gradient. This makes training very unstable. As a Monte Carlo method, REINFORCE only learns at the end of each episode. In contrast, temporal difference methods learn at each step, allowing for more efficient learning. Actor critic methods address these issues with the Critic network, unlocking Temporal Difference learning. We will review the Advantage Actor Critic algorithm, or A2C, using the TD error as the advantage.

3. The intuition behind Actor Critic methods

Imagine a student preparing for an exam. She could study alone and only get feedback when the exam grades are in or join a study group and regularly have peers test her knowledge. The student is like the actor network, deciding what to study and answering practice questions, but poorly placed to evaluate her own progress. The study group is like the critic network, providing regular feedback to help the student learn better and faster. More formally, the Critic network is a value network. Its role is to evaluate the value function at every step to judge the quality of the latest action and provide this feedback to the Actor.

4. The Critic network

Like the Q-networks we encountered in Deep Q-Learning, the critic network is a value function approximator; however, it approximates the state-value function V rather than the action-value function Q. The Critic's role is to judge the Actor's latest action a_t using its advantage or TD-error. The architecture of the Critic network is similar to that of Q-networks, but it has a single output node for the state value.

5. The Actor Critic dynamics

Let's explore the dynamics of Actor-Critic algorithms. The Actor network resembles the policy network in REINFORCE. It represents the agent's stochastic policy, sampling an action at each step.

6. The Actor Critic dynamics

The critic network observes the resulting reward and updated state from the environment.

7. The Actor Critic dynamics

The critic evaluates the TD Error. This is crucial for the loss calculations.

8. The Actor Critic dynamics

The updated Actor network also observes the new state.

9. The Actor Critic dynamics

The process starts over, with the actor selecting its action for the next step.

10. The A2C losses

We need both the critic and actor networks to learn over time: the critic needs to approximate the state value better, and the actor needs to improve its policy. The critic loss is the squared TD error. The TD error is the critic's rating of the action. Actions with a good outcome have a positive TD error. The actor loss is calculated by taking the negative of the action's log probability times the TD error. This means actions leading to good outcomes (positive TD error) become more likely.

11. Calculating the losses

Let's calculate the losses. First, we calculate the current and next values, the TD target, and the TD error. We then calculate the Actor network loss. The formula includes the TD error, which depends on the Critic network, so we use the torch .detach() method to prevent the actor gradient descent from propagating to the critic weights. The critic loss is the squared TD error.

12. The Actor Critic training loop

On to the training loop. The actor selects the action; new state and reward are observed. The losses are calculated, and the weights updated, for both networks.

13. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.