Solving a multi-armed bandit
This exercise involves implementing an epsilon-greedy strategy to solve a 10-armed bandit problem, where the epsilon value decays over time to shift from exploration to exploitation.
epsilon, min_epsilon, and epsilon_decay have been pre-defined for you. The epsilon_greedy() function has been imported as well.
Cet exercice fait partie du cours
Reinforcement Learning with Gymnasium in Python
Instructions
- Use the
create_multi_armed_bandit()function to initialize a 10-armed bandit problem, which will returntrue_bandit_probs,counts,values,rewards, andselected_arms. - Select an arm to pull using the
epsilon_greedy()function. - Simulate the
rewardbased on the true bandit probabilities. - Decay the
epsilonvalue ensuring that it does not fall below themin_epsilonvalue.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Create a 10-armed bandit
true_bandit_probs, counts, values, rewards, selected_arms = ____
for i in range(n_iterations):
# Select an arm
arm = ____
# Compute the received reward
reward = ____
rewards[i] = reward
selected_arms[i] = arm
counts[arm] += 1
values[arm] += (reward - values[arm]) / counts[arm]
# Update epsilon
epsilon = ____