Creating a multi-armed bandit
A multi-armed bandit problem is a classic example used in reinforcement learning to describe a scenario where an agent must choose between multiple actions (or "arms") without knowing the expected reward of each. Over time, the agent learns which arm yields the highest reward by exploring each option. This exercise involves setting up the foundational structure for simulating a multi-armed bandit problem.
The numpy
library has been imported as np
.
Cet exercice fait partie du cours
Reinforcement Learning with Gymnasium in Python
Instructions
- Generate an array
true_bandit_probs
with random probabilities representing the true underlying success rate for each bandit. - Initialize two arrays,
counts
andvalues
, with zeros;counts
tracks the number of times each bandit has been chosen, and values represents the estimated winning probability of each bandit. - Create
rewards
andselected_arms
arrays, to store the rewards obtained and the arms selected in each iteration.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
def create_multi_armed_bandit(n_bandits):
# Generate the true bandits probabilities
true_bandit_probs = ____
# Create arrays that store the count and value for each bandit
counts = ____
values = ____
# Create arrays that store the rewards and selected arms each episode
rewards = ____
selected_arms = ____
return true_bandit_probs, counts, values, rewards, selected_arms