CommencerCommencer gratuitement

Creating a multi-armed bandit

A multi-armed bandit problem is a classic example used in reinforcement learning to describe a scenario where an agent must choose between multiple actions (or "arms") without knowing the expected reward of each. Over time, the agent learns which arm yields the highest reward by exploring each option. This exercise involves setting up the foundational structure for simulating a multi-armed bandit problem.

The numpy library has been imported as np.

Cet exercice fait partie du cours

Reinforcement Learning with Gymnasium in Python

Afficher le cours

Instructions

  • Generate an array true_bandit_probs with random probabilities representing the true underlying success rate for each bandit.
  • Initialize two arrays, counts and values, with zeros; counts tracks the number of times each bandit has been chosen, and values represents the estimated winning probability of each bandit.
  • Create rewards and selected_arms arrays, to store the rewards obtained and the arms selected in each iteration.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

def create_multi_armed_bandit(n_bandits):
  	# Generate the true bandits probabilities
    true_bandit_probs = ____ 
    # Create arrays that store the count and value for each bandit
    counts = ____  
    values = ____  
    # Create arrays that store the rewards and selected arms each episode
    rewards = ____
    selected_arms = ____ 
    return true_bandit_probs, counts, values, rewards, selected_arms
Modifier et exécuter le code