Get startedGet started for free

Creating a multi-armed bandit

A multi-armed bandit problem is a classic example used in reinforcement learning to describe a scenario where an agent must choose between multiple actions (or "arms") without knowing the expected reward of each. Over time, the agent learns which arm yields the highest reward by exploring each option. This exercise involves setting up the foundational structure for simulating a multi-armed bandit problem.

The numpy library has been imported as np.

This exercise is part of the course

Reinforcement Learning with Gymnasium in Python

View Course

Exercise instructions

  • Generate an array true_bandit_probs with random probabilities representing the true underlying success rate for each bandit.
  • Initialize two arrays, counts and values, with zeros; counts tracks the number of times each bandit has been chosen, and values represents the estimated winning probability of each bandit.
  • Create rewards and selected_arms arrays, to store the rewards obtained and the arms selected in each iteration.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

def create_multi_armed_bandit(n_bandits):
  	# Generate the true bandits probabilities
    true_bandit_probs = ____ 
    # Create arrays that store the count and value for each bandit
    counts = ____  
    values = ____  
    # Create arrays that store the rewards and selected arms each episode
    rewards = ____
    selected_arms = ____ 
    return true_bandit_probs, counts, values, rewards, selected_arms
Edit and Run Code