Reinforcement Learning
Reinforcement Learning
an agent to learn in an interactive environment by trial and error using feedback from
its own actions and experiences.
Unlike supervised learning where the feedback provided to the agent is correct set of
actions for performing a task, reinforcement learning uses rewards and
punishments as signals for positive and negative behavior. As compared to
unsupervised learning, reinforcement learning is different in terms of goals. While the
goal in unsupervised learning is to find similarities and differences between data
points, in the case of reinforcement learning the goal is to find a suitable action model
that would maximize the total cumulative reward of the agent.
An RL problem can be best explained through games. Let’s take the game
of PacMan where the goal of the agent(PacMan) is to eat the food in the grid while
avoiding the ghosts on its way. In this case, the grid world is the interactive
environment for the agent where it acts. Agent receives a reward for eating food and
punishment if it gets killed by the ghost (loses the game). The states are the location of
the agent in the grid world and the total cumulative reward is the agent winning the
game.
Value-Based – The main goal of this method is to maximize a value function. Here, an
agent through a policy expects a long-term return of the current states.
Policy-Based – In policy-based, you enable to come up with a strategy that helps to
gain maximum rewards in the future through possible actions performed in each state.
Two types of policy-based methods are deterministic and stochastic.
Model-Based – In this method, we need to create a virtual model for the agent to help
in learning to perform in each specific environment
Q-learning is a commonly used model-free approach which can be used for building
a self-playing PacMan agent. It revolves around the notion of updating Q values which
denotes value of performing action a in state s. The following value update rule is the
core of the Q-learning algorithm.
Q-learning and SARSA (State-Action-Reward-State-Action) are two commonly
used model-free RL algorithms. They differ in terms of their exploration strategies
while their exploitation strategies are similar. While Q-learning is an off-policy method
in which the agent learns the value based on action a* derived from the another policy,
SARSA is an on-policy method where it learns the value based on its current
action a derived from its current policy. These two methods are simple to implement
but lack generality as they do not have the ability to estimates values for unseen states.
Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other words, it
has a positive effect on behavior.
Advantages of reinforcement learning are:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can diminish the
results
Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative
condition is stopped or avoided.
Advantages of reinforcement learning:
Increases Behavior
Provide defiance to a minimum standard of performance
It Only provides enough to meet up the minimum behavior
Q-Learning
We build an agent who will interact with the environment through a trial-error
process. At each time step t, the agent is at a certain state s_t and chooses an action
a_t to perform. The environment runs the selected action and returns a reward to the
agent. The higher is the reward, the better is the action. The environment also tells the
agent whether he is done or not. So an episode can be represented as a sequence of
state-action-reward.
In the Q-Learning algorithm, the goal is to learn iteratively the optimal Q-value
function using the Bellman Optimality Equation. To do so, we store all the Q-values in
a table that we will update at each time step using the Q-Learning iteration:
Markov Decision Process (MDP)
A Markov decision process (MDP) refers to a stochastic decision-making process that
uses a mathematical framework to model the decision-making of a dynamic system.
It is used in scenarios where the results are either random or controlled by a decision
maker, which makes sequential decisions over time. MDPs evaluate which actions the
decision maker should take considering the current state and environment of the
system.
The MDP model uses the Markov Property, which states that the future can be
determined only from the present state that encapsulates all the necessary
information from the past. The Markov Property can be evaluated by using this
equation:
P[St+1|St] = P[St+1 |S1,S2,S3……St]
According to this equation, the probability of the next state (P[St+1]) given the
present state (St) is given by the next state’s probability (P[St+1]) considering all the
previous states (S1,S2,S3……St). This implies that MDP uses only the present/current
state to evaluate the next actions without any dependencies on previous states or
actions.
A Markov process is defined by (S, P) where S are the states, and P is the state-
transition probability. It consists of a sequence of random states S₁, S₂, … where all
the states obey the Markov property. The state transition probability or P_ss’ is the
probability of jumping to a state s’ from the current state s.
The Markov reward process (MRP) is defined by (S, P, R, γ), where S are the
states, P is the state-transition probability, R_s is the reward, and γ is the discount
factor.
The variable γ ∈ [0, 1] in the figure is the discount factor. The intuition behind using a
discount is that there is no certainty about the future rewards. While it is important to
consider future rewards to increase the Return, it’s also equally important to limit the
contribution of the future rewards to the Return (since you can’t be 100 percent certain
of the future.)
The policy (Π) is known to determine the agent’s optimal action given the current
state so that it gains the maximum reward. In simple words, it associates actions with
states.
Π: S –> A
To determine the best policy, it is essential to define the returns that reveal the
agent’s rewards at every state. As a result, the horizon method is not preferred to
focus on short-term or long term-rewards. Instead, a variable termed ‘discounted
factor (γ)’ is introduced. The rule says if γ has values that are closer to zero, then the
immediate rewards are prioritized. Subsequently, if γ reveals values closer to one,
then the focus shifts to long-term rewards. Hence, the discounted infinite-horizon
method is key to revealing the best policy.
The state value function v(s) is the expected Return starting from state s.
The value function can be divided into two components: the reward of the current
state and the discounted reward value of the next state. This breakdown
derives Bellman’s equation, as shown below:
Here, it is worth noting that the agent’s actions and rewards vary based on the policy.
This implies that the value function is specific to a policy.
We have a problem where we need to decide whether the tribes should go deer
hunting or not in a nearby forest to ensure long-term returns. Each deer generates a
fixed return. However, if the tribes hunt beyond a limit, it can result in a lower yield
next year. Hence, we need to determine the optimum portion of deer that can be
caught while maximizing the return over a longer period.
The problem statement can be simplified in this case: whether to hunt a certain
portion of deer or not. In the context of MDP, the problem can be expressed as follows:
States: The number of deer available in the forest in the year under consideration.
The four states include empty, low, medium, and high, which are defined as follows:
• Empty: No deer available to hunt
• Low: Available deer count is below a threshold t_1
• Medium: Available deer count is between t_1 and t_2
• High: Available deer count is above a threshold t_2
Actions: Actions include go_hunt and no_hunting, where go_hunt implies catching
certain proportions of deer. It is important to note that for the empty state, the only
possible action is no_hunting.
Rewards: Hunting at each state generates rewards of some kind. The rewards for
hunting at different states, such as state low, medium, and high, maybe $5K, $50K,
and $100k, respectively. Moreover, if the action results in an empty state, the reward
is -$200K. This is due to the required e-breeding of new deer, which involves time and
money.
State transitions: Hunting in a state causes the transition to a state with fewer deer.
Subsequently, the action of no_hunting causes the transition to a state with more
deer, except for the ‘high’ state.