Get startedGet started for free

Multi-armed bandits

1. Multi-armed bandits

The final step of our journey will be about multi-armed bandits, a cornerstone concept in RL that simplifies the exploration-exploitation dilemma into a more tangible format. Specifically, we'll learn how to create, solve, and analyze a multi-armed bandit environment with a decayed epsilon-greedy strategy.

2. Multi-armed bandits

The Multi-Armed Bandit problem is an analogy to a gambler facing a row of slot machines, each with a different, unknown probability of winning. The challenge is to maximize their winnings by deciding which machine to play, how many times to play it, and when to switch to another machine. This scenario perfectly encapsulates the exploration-exploitation trade-off: exploring to find the machine with the highest reward but exploiting known information to maximize winnings.

3. Slot machines

To create a simulated multi-armed bandit environment, we start by assuming we have a set of slot machines, each with its own probability of winning. These probabilities are typically unknown to the agent and must be learned over time through interaction. We simulate each slot machine as a bandit arm, where pulling an arm, results in a reward of +1 or 0, and the agent's goal is to accumulate as much reward as possible.

4. Solving the problem

One strategy to solve the problem is the decayed epsilon-greedy approach, where with probability epsilon, we explore by selecting an arm at random,

5. Solving the problem

and with probability 1-epsilon, we exploit by choosing the arm with the highest average reward so far, and the epsilon value keeps decreasing over time.

6. Initialization

To do that in code, we define the number of bandits, each with its own probability of winning, represented with the true_bandit_probs array. Then, we define the number of iterations, the initial epsilon, min_epsilon, and epsilon_decay. Then we initialize counts, an array to record the number of times the agent engaged with each bandit, values, an array to hold the expected winning probability of each bandit based on the rewards we receive, rewards, an array dedicated to logging the reward obtained at each iteration, and selected_arms, an array to note which bandit was chosen at every step.

7. Interaction loop

Then, the agent interacts with the machines. For every iteration, the agent pulls an arm using the epsilon_greedy strategy, and might or might not receive a reward, depending on the bandit probability. We then update the rewards, selected_arms, and counts arrays. We update our estimates of the bandit's value based on the rewards received using an incremental formula that adjusts the current estimate towards the latest observed reward, ensuring that estimates become more accurate over time without needing to store or recompute the entire history of rewards. Finally, we update epsilon.

8. Analyzing selections

Now to analyze how this strategy has performed over the multitude of iterations, we initialize a matrix called selections_percentage with dimensions matching the number of iterations and the number of bandits.

9. Analyzing selections

For every iteration, we mark the selected arm in this matrix.

10. Analyzing selections

By dividing the cumulative sum of selections for each bandit by the total number of iterations up to that point, we normalize our data to reflect percentage selections of each bandit throughout episodes.

11. Analyzing selections

Plotting the selection percentage curve for each arm, we observe that the selection pattern of bandits varies across episodes, initially appearing somewhat random and subsequently gravitating more towards bandit number 2. This shift demonstrates the adaptive nature of our strategy. Comparing with the true probabilities of the bandits, it becomes evident that the agent has successfully learned to select the bandit with the highest probability through the decayed epsilon-greedy strategy.

12. Let's practice!

Now, it's time to practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.