1. Markov Decision Processes
A critical aspect of solving RL problems is the ability to model the environment effectively. This is where Markov Decision Processes or MDPs come into play.
2. MDP
An MDP provides a mathematical framework for modeling RL environments.
It simplifies complex environments by defining four key components: states, actions, and rewards which we are already familiar with, and transition probabilities which represent the likelihood of moving from one state to another following an action. This is vital in unpredictable environments where actions can lead to different outcomes.
3. MDP
Once these components are defined, we can proceed to solve the environment using model-based RL techniques.
4. Markov property
At the heart of MDPs is the Markov property, which states that
the future state depends only on the current state and action, not on previous events.
Just like a chess game, where the next position is determined by the current arrangement and the move made, not by the move history.
5. Frozen Lake as MDP
Let’s frame the Frozen Lake environment as an MDP.
Here, the lake is a grid where the agent must reach a goal without falling into holes.
6. Frozen Lake as MDP - states
The states represent the grid positions that the agent can occupy, where each cell corresponds to a distinct state.
Here, we can observe three exemplary states.
7. Frozen Lake as MDP - terminal states
Some states are called terminal states as they lead to episode termination.
This happens when the agent reaches the goal or falls into a hole.
8. Frozen Lake as MDP - actions
The actions are moves such as up, down, left, and right, each represented by a specific number.
9. Frozen Lake as MDP - transitions
If we want to model the environment realistically, we would note that actions don’t necessarily lead to their expected outcomes.
Notably, since the lake is frozen, if the agent moves to the right,
10. Frozen Lake as MDP - transitions
it might go indeed to the right,
11. Frozen Lake as MDP - transitions
but it might also slip and fall down
12. Frozen Lake as MDP - transitions
or even stay in the current state.
13. Frozen Lake as MDP - transitions
For this reason, we need the transition probabilities that inform us of the likelihood of reaching a particular state given a current state and action.
14. Frozen Lake as MDP - rewards
Finally, a positive reward is only given when the agent reaches the goal state.
15. Gymnasium states and actions
The Gymnasium library provides MDP components for appropriate environments.
When creating the Frozen Lake environment, the 'is_slippery' property controls action outcomes: set to True, it introduces unpredictable movements, causing potential perpendicular moves like moving up instead of left. Set to False, it ensures deterministic movements.
env.action_space defines the range of actions an agent can take. Here, it is Discrete(4), meaning the agent has four possible actions, numbered from 0 to 3.
env.observation_space identifies the different state ranges the agent may encounter.
Here, it's Discrete(16), corresponding to the 16 cells on the 4x4 grid.
To get the number of actions and states in a discrete environment, we use env.action_space.n and env.observation_space.n, returning 4 and 16 respectively.
16. Gymnasium rewards and transitions
The transition probabilities and rewards can be accessed through
env.unwrapped.P, a dictionary with state-action pairs as keys.
For a given state and action,
the value is a list of tuples, each specifying the transition probability to the next_state, the next_state itself, the reward, and whether the next_state is terminal.
17. Gymnasium rewards and transitions - example
For instance, consider an agent in cell state 6 deciding to move left with action 0.
env.unwrapped.P[6][0] returns a list of tuples, each representing a potential outcome. In this example, the agent has a 33% chance of ending up in each of the cells 2, 5, or 10 after executing the left action. It won’t receive any reward, and reaching state 5 will result in the termination of the game.
18. Let's practice!
Time for some practice!