Get startedGet started for free

Navigating the RL framework

1. Navigating the RL framework

Building on our RL foundations, let's uncover how its core components interact and influence the agent's strategic decisions.

2. RL framework

The RL framework consists of five key components: the agent,

3. RL framework

environment,

4. RL framework

states, actions, and rewards. The agent, acting as the learner or decision-maker, is like a player in a game. It interacts with the environment, which presents various challenges to be solved.

5. RL framework

Within this environment, a state represents a specific moment in time, much like video game frame, capturing the current situation that the agent observes.

6. RL framework

The agent's actions are responses to these states,

7. RL framework

and rewards from the environment are feedback on these actions, either positive to encourage or negative to discourage certain behaviors.

8. RL interaction loop

Let's demonstrate the agent-environment interaction using a generic code example, setting the stage for advanced scenarios we'll explore using gymnasium environments. The process starts by creating an environment and retrieving the initial state. The agent then enters a loop where it selects an action based on the current state in each iteration. After executing the action, the environment provides feedback in the form of a new state and a reward. Finally, the agent updates its knowledge based on the state, action, and reward it received.

9. Episodic vs. continuous tasks

In RL, we encounter two types of tasks: episodic and continuous. Episodic tasks are divided into distinct episodes, each with a defined beginning and end. For example, in a chess game played by an agent, each game constitutes an episode. Once a game concludes, the environment resets for the next one. On the other hand, continuous tasks involve ongoing interaction without distinct episodes. A typical example is an agent continuously adjusting traffic lights in a city to optimize flow. In this course, we will primarily focus on episodic tasks, which are generally more common.

10. Return

In RL, actions carry long-term consequences, impacting both immediate and future rewards. The agent's goal goes beyond maximizing immediate gains; it strives to accumulate the highest total reward over time. This leads us to a key concept in RL: the return. The return is the sum of all rewards the agent expects to accumulate throughout its journey. Accordingly, the agent learns to anticipate the sequence of actions that will yield the highest possible return.

11. Discounted return

However, immediate rewards are typically valued more than future ones, leading to the concept of 'discounted return'. This concept prioritizes more recent rewards by multiplying each reward by a discount factor, gamma, raised to the power of its respective time step. For example, for expected rewards r1 through rn, the discounted return would be calculated as r1 + gamma * r2 + gamma^2 * r3, and so on.

12. Discount factor

The discount factor gamma, ranging between 0 and 1, is crucial for balancing immediate and long-term rewards. A lower gamma value leads the agent to prioritize immediate gains, while a higher value emphasizes long-term benefits. At the extremes, a gamma of zero means the agent focuses solely on immediate rewards, while a gamma of one considers future rewards as equally important, applying no discount.

13. Numerical example

In this example, we'll demonstrate how to calculate the discounted_return from an array of expected_rewards. We define a discount_factor of 0.9, then create an array of discounts, where each element corresponds to the discount factor raised to the power of the reward's position in the sequence. As we can see, discounts decrease over time, giving less importance to future rewards. Next, we multiply each reward by its corresponding discount and sum the results to compute the discounted_return, which is 8.83 in this example.

14. Let's practice!

Now, let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.