Get startedGet started for free

Balancing exploration and exploitation

1. Balancing exploration and exploitation

Now that we're familiar with key RL algorithms, it is time to dive into balancing exploration and exploitation, a crucial technique for efficient learning strategies.

2. Training with random actions

Until now, when training our agent using Temporal Difference methods, the agent's actions have been chosen randomly, allowing the agent to explore the environment, which is essential in the early stages of learning. However, this method prevents the agent from optimizing its strategy based on what it has already learned. In other words, the agent does not utilize its knowledge until training concludes.

3. Exploration-exploitation trade-off

The exploration-exploitation trade-off tackles this issue by balancing between exploring new actions to gain information and exploiting acquired knowledge to maximize rewards. Continuous exploration prevents strategy refinement, while exclusive exploitation risks missing undiscovered opportunities.

4. Dining choices

This trade-off is similar to choosing where to eat, where exploration is like trying a new restaurant, while exploitation is like choosing to eat at a favorite one. Both approaches are crucial, but finding the right balance is key to developing an efficient learning strategy for the agent. To do that, strategies such as epsilon-greedy and decayed epsilon-greedy are employed.

5. Epsilon-greedy strategy

The epsilon-greedy strategy involves choosing to explore, or selecting an action at random, with a probability of epsilon,

6. Epsilon-greedy strategy

and choosing to exploit, or select the best-known action, with a probability of 1 - epsilon. This approach ensures that the agent continues to explore new actions with a certain likelihood while primarily exploiting its current knowledge to maximize rewards.

7. Decayed epsilon-greedy strategy

The decayed epsilon-greedy strategy refines this approach by gradually reducing the value of epsilon over time. Initially, a higher epsilon value is used to encourage more exploration, allowing the agent to gather information about the environment. As the agent learns more about its environment, epsilon is decreased according to a predefined decay schedule, shifting the balance towards exploitation. This decay ensures that the agent increasingly relies on its accumulated knowledge to make decisions, optimizing its strategy as it becomes more familiar with the environment.

8. Implementation with Frozen Lake

Let's put this into practice with Q-learning in the Frozen Lake environment. First, we define a Q-table as a numpy array with dimensions reflecting the state_size and action_size. Then, we define the learning rate, discount factor, and total_episodes.

9. Implementing epsilon_greedy()

We define a function epsilon_greedy that receives a state as input, and compares a randomly generated number with epsilon. If the random number is below epsilon, the agent explores the environment by selecting a random action. Otherwise, the agent exploits its knowledge by choosing the action with the highest Q-value for the current state.

10. Training epsilon-greedy

To train the agent using the epsilon-greedy approach, we first define the exploration rate, epsilon, 0.9 in our case. Then, the training loop is just as before, where the agent selects an action, executes it, and updates the q_table, with the only exception of using the epsilon_greedy() function to select the action. We also monitor the rewards collected through each episode to analyze the training process.

11. Training decayed epsilon-greedy

Now, to train with the decayed strategy, we define an initial epsilon, epsilon_decay, a factor by which epsilon is multiplied at the end of each episode, and a min_epsilon which we don't go below. Then, during training, we monitor the rewards for further analysis, and nothing changes except that we multiply epsilon by epsilon_decay after each episode, until reaching min_epsilon.

12. Comparing strategies

Computing and plotting the average rewards collected per episode when training with an epsilon_greedy and decayed_epsilon_greedy strategies, we see that the latter performs much better, as throughout the episodes, the agent uses its accumulated knowledge gradually.

13. Let's practice!

Now, let's practice!