Creating a multi-armed bandit

A multi-armed bandit problem is a classic example used in reinforcement learning to describe a scenario where an agent must choose between multiple actions (or "arms") without knowing the expected reward of each. Over time, the agent learns which arm yields the highest reward by exploring each option. This exercise involves setting up the foundational structure for simulating a multi-armed bandit problem.

The numpy library has been imported as np.

Generate an array true_bandit_probs with random probabilities representing the true underlying success rate for each bandit.
Initialize two arrays, counts and values, with zeros; counts tracks the number of times each bandit has been chosen, and values represents the estimated winning probability of each bandit.
Create rewards and selected_arms arrays, to store the rewards obtained and the arms selected in each iteration.

Exercise

Creating a multi-armed bandit

Instructions

.css-6su6fj{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;}Exercise

Instructions

Exercise