The barebone DQN algorithm
1. The barebone DQN algorithm
Welcome back. Today, we will cover our first DRL algorithm.2. The Barebone DQN
Let's call it the Barebone DQN algorithm. The full DQN algorithm as published by Deepmind in 2015 features a number of clever tricks. We will introduce them progressively over the next videos. The Barebone DQN is what we get when we put together all the components we have introduced so far in this course: the generic DRL training loop; a Q-network; and the core principles of Q-learning. It is technically trainable, but do not expect it to perform very well. Think of it instead as a stepping stone towards more useful algorithms. Here is the generic training loop introduced previously. Let's now zoom into the action selection and loss calculation, in this order.3. The Barebone DQN action selection
First, let's write the select_action function. The forward pass feeds the state as input to the network, and yields as output the Q-value associated with each action. The agent's policy is to select the action with the highest Q-value from the Q Network at each step. In this example, the maximum Q-value is 0.12. This results from the action with index 2. The argmax of a vector is the index at which its maximum value can be found. The torch.argmax function returns a tensor with one element; using the item method, we can extract that element as an integer.4. The Barebone DQN loss function
Now consider the `calculate_loss` function. Recall that the Action-Value function Q satisfies the Bellman equation, forming the basis of the Q-learning updates. In DQN, we want to minimize the difference between both sides of the Bellman equation, also known as TD-error. The term Bellman error normally refers to the expected TD-error, but, assuming deterministic environment and policy for simplicity, they coincide. Taking the squared Bellman error, which penalizes large deviations, is a popular choice for the DQN loss function. Here we denote as theta the parameters of the Q-network. Do not worry if these equations appear intimidating. The primary focus here is on code implementation, and a deep understanding of the underlying theory is not required to complete the course.5. The Barebone DQN loss function
Let's transition from this mathematical formulation to our Python implementation. First, we get the q-values by passing the state to the q_network, storing the result in q_values. This tensor contains a Q-value for each possible action. To get the Q-value for the current state, we index q_values with the action index. Next, we determine the maximum Q-value for the next state by passing next_state to q_network and applying the .max method. The target Q-value includes the reward obtained from the gymnasium environment. It requires a hyperparameter gamma for the discount rate (for example 0.99), and the next state Q-value from the previous step. When the episode is complete, the next state's q-value should be zero, achieved by multiplying by 1-done. Finally, the squared Bellman error is calculated by applying PyTorch's MSELoss loss function to both sides of the Bellman equation.6. Describing the episodes
In the upcoming exercises, you will be using a function called describe_episode() to help you observe how your agent is doing. For each episode, it will output the number of steps and the return.7. Let's practice!
Let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.