Implementing Q-learning update rule
Q-learning is an off-policy algorithm in reinforcement learning (RL) that seeks to find the best action to take given the current state. Unlike SARSA, which considers the actual next action taken, Q-learning updates its Q-values using the maximum future reward irrespective of the action taken. This distinction allows Q-learning to learn the optimal policy while following an exploratory or even a random policy. Here's the task to implement a function that updates a Q-table based on the Q-learning rule. The Q-learning update rule is below, and your task is to implement a function that updates a Q-table based on this rule.
The NumPy library has been imported to you as np
.
This exercise is part of the course
Reinforcement Learning with Gymnasium in Python
Exercise instructions
- Retrieve the current Q-value for the given state-action pair.
- Determine the maximum Q-value for the next state across all possible actions in
actions
. - Update the Q-value for the current state-action pair using the Q-learning formula.
- Update the Q-table
Q
, given that an agent takes action0
in state0
, receives a reward of5
, and moves to state1
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
actions = ['action1', 'action2']
def update_q_table(state, action, reward, next_state):
# Get the old value of the current state-action pair
old_value = ____
# Determine the maximum Q-value for the next state
next_max = ____
# Compute the new value of the current state-action pair
Q[state, action] = ____
alpha = 0.1
gamma = 0.95
Q = np.array([[10, 8], [20, 15]], dtype='float32')
# Update the Q-table
____
print(Q)