Get startedGet started for free

Solving a multi-armed bandit

This exercise involves implementing an epsilon-greedy strategy to solve a 10-armed bandit problem, where the epsilon value decays over time to shift from exploration to exploitation.

epsilon, min_epsilon, and epsilon_decay have been pre-defined for you. The epsilon_greedy() function has been imported as well.

This exercise is part of the course

Reinforcement Learning with Gymnasium in Python

View Course

Exercise instructions

  • Use the create_multi_armed_bandit() function to initialize a 10-armed bandit problem, which will return true_bandit_probs, counts, values, rewards, and selected_arms.
  • Select an arm to pull using the epsilon_greedy() function.
  • Simulate the reward based on the true bandit probabilities.
  • Decay the epsilon value ensuring that it does not fall below the min_epsilon value.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a 10-armed bandit
true_bandit_probs, counts, values, rewards, selected_arms = ____

for i in range(n_iterations): 
  	# Select an arm
    arm = ____
    # Compute the received reward
    reward = ____
    rewards[i] = reward
    selected_arms[i] = arm
    counts[arm] += 1
    values[arm] += (reward - values[arm]) / counts[arm]
    # Update epsilon
    epsilon = ____
Edit and Run Code