0% found this document useful (0 votes)
2 views

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning technique where an agent learns through trial and error, using rewards and punishments to inform its actions. It involves concepts such as environment, state, reward, policy, and value, and can be implemented using various methods like Q-learning and SARSA. The Markov Decision Process (MDP) framework is commonly used to model RL problems, focusing on maximizing cumulative rewards while considering the current state and actions taken.

Uploaded by

yashkamra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning technique where an agent learns through trial and error, using rewards and punishments to inform its actions. It involves concepts such as environment, state, reward, policy, and value, and can be implemented using various methods like Q-learning and SARSA. The Markov Decision Process (MDP) framework is commonly used to model RL problems, focusing on maximizing cumulative rewards while considering the current state and actions taken.

Uploaded by

yashkamra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Reinforcement Learning(RL) is a type of machine learning technique that enables

an agent to learn in an interactive environment by trial and error using feedback from
its own actions and experiences.

Unlike supervised learning where the feedback provided to the agent is correct set of
actions for performing a task, reinforcement learning uses rewards and
punishments as signals for positive and negative behavior. As compared to
unsupervised learning, reinforcement learning is different in terms of goals. While the
goal in unsupervised learning is to find similarities and differences between data
points, in the case of reinforcement learning the goal is to find a suitable action model
that would maximize the total cumulative reward of the agent.

1. Environment — Physical world in which the agent operates


2. State — Current situation of the agent
3. Reward — Feedback from the environment
4. Policy — Method to map agent’s state to actions
5. Value — Future reward that an agent would receive by taking an action in
a particular state

An RL problem can be best explained through games. Let’s take the game
of PacMan where the goal of the agent(PacMan) is to eat the food in the grid while
avoiding the ghosts on its way. In this case, the grid world is the interactive
environment for the agent where it acts. Agent receives a reward for eating food and
punishment if it gets killed by the ghost (loses the game). The states are the location of
the agent in the grid world and the total cumulative reward is the agent winning the
game.

Value-Based – The main goal of this method is to maximize a value function. Here, an
agent through a policy expects a long-term return of the current states.
Policy-Based – In policy-based, you enable to come up with a strategy that helps to
gain maximum rewards in the future through possible actions performed in each state.
Two types of policy-based methods are deterministic and stochastic.
Model-Based – In this method, we need to create a virtual model for the agent to help
in learning to perform in each specific environment

Markov Decision Processes(MDPs) are mathematical frameworks to describe an


environment in RL and almost all RL problems can be formulated using MDPs. An
MDP consists of a set of finite environment states S, a set of possible actions A(s) in
each state, a real valued reward function R(s) and a transition model P(s’, s | a).
However, real world environments are more likely to lack any prior knowledge of
environment dynamics. Model-free RL methods come handy in such cases. The set of
parameters that include Set of finite states – S, Set of possible Actions in each state –
A, Reward – R, Model – T, Policy – π. The outcome of deploying an action to a state
doesn’t depend on previous actions or states but on current action and state.

Q-learning is a commonly used model-free approach which can be used for building
a self-playing PacMan agent. It revolves around the notion of updating Q values which
denotes value of performing action a in state s. The following value update rule is the
core of the Q-learning algorithm.
Q-learning and SARSA (State-Action-Reward-State-Action) are two commonly
used model-free RL algorithms. They differ in terms of their exploration strategies
while their exploitation strategies are similar. While Q-learning is an off-policy method
in which the agent learns the value based on action a* derived from the another policy,
SARSA is an on-policy method where it learns the value based on its current
action a derived from its current policy. These two methods are simple to implement
but lack generality as they do not have the ability to estimates values for unseen states.

Types of Reinforcement: There are two types of Reinforcement:

Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other words, it
has a positive effect on behavior.
Advantages of reinforcement learning are:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can diminish the
results
Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative
condition is stopped or avoided.
Advantages of reinforcement learning:
Increases Behavior
Provide defiance to a minimum standard of performance
It Only provides enough to meet up the minimum behavior
Q-Learning
We build an agent who will interact with the environment through a trial-error
process. At each time step t, the agent is at a certain state s_t and chooses an action
a_t to perform. The environment runs the selected action and returns a reward to the
agent. The higher is the reward, the better is the action. The environment also tells the
agent whether he is done or not. So an episode can be represented as a sequence of
state-action-reward.

In the Q-Learning algorithm, the goal is to learn iteratively the optimal Q-value
function using the Bellman Optimality Equation. To do so, we store all the Q-values in
a table that we will update at each time step using the Q-Learning iteration:
Markov Decision Process (MDP)
A Markov decision process (MDP) refers to a stochastic decision-making process that
uses a mathematical framework to model the decision-making of a dynamic system.
It is used in scenarios where the results are either random or controlled by a decision
maker, which makes sequential decisions over time. MDPs evaluate which actions the
decision maker should take considering the current state and environment of the
system.

The MDP model uses the Markov Property, which states that the future can be
determined only from the present state that encapsulates all the necessary
information from the past. The Markov Property can be evaluated by using this
equation:
P[St+1|St] = P[St+1 |S1,S2,S3……St]
According to this equation, the probability of the next state (P[St+1]) given the
present state (St) is given by the next state’s probability (P[St+1]) considering all the
previous states (S1,S2,S3……St). This implies that MDP uses only the present/current
state to evaluate the next actions without any dependencies on previous states or
actions.
A Markov process is defined by (S, P) where S are the states, and P is the state-
transition probability. It consists of a sequence of random states S₁, S₂, … where all
the states obey the Markov property. The state transition probability or P_ss’ is the
probability of jumping to a state s’ from the current state s.

The Markov reward process (MRP) is defined by (S, P, R, γ), where S are the
states, P is the state-transition probability, R_s is the reward, and γ is the discount
factor.
The variable γ ∈ [0, 1] in the figure is the discount factor. The intuition behind using a
discount is that there is no certainty about the future rewards. While it is important to
consider future rewards to increase the Return, it’s also equally important to limit the
contribution of the future rewards to the Return (since you can’t be 100 percent certain
of the future.)

The policy and value function

The policy (Π) is known to determine the agent’s optimal action given the current
state so that it gains the maximum reward. In simple words, it associates actions with
states.
Π: S –> A
To determine the best policy, it is essential to define the returns that reveal the
agent’s rewards at every state. As a result, the horizon method is not preferred to
focus on short-term or long term-rewards. Instead, a variable termed ‘discounted
factor (γ)’ is introduced. The rule says if γ has values that are closer to zero, then the
immediate rewards are prioritized. Subsequently, if γ reveals values closer to one,
then the focus shifts to long-term rewards. Hence, the discounted infinite-horizon
method is key to revealing the best policy.

The state value function v(s) is the expected Return starting from state s.
The value function can be divided into two components: the reward of the current
state and the discounted reward value of the next state. This breakdown
derives Bellman’s equation, as shown below:

Here, it is worth noting that the agent’s actions and rewards vary based on the policy.
This implies that the value function is specific to a policy.

We have a problem where we need to decide whether the tribes should go deer
hunting or not in a nearby forest to ensure long-term returns. Each deer generates a
fixed return. However, if the tribes hunt beyond a limit, it can result in a lower yield
next year. Hence, we need to determine the optimum portion of deer that can be
caught while maximizing the return over a longer period.
The problem statement can be simplified in this case: whether to hunt a certain
portion of deer or not. In the context of MDP, the problem can be expressed as follows:
States: The number of deer available in the forest in the year under consideration.
The four states include empty, low, medium, and high, which are defined as follows:
• Empty: No deer available to hunt
• Low: Available deer count is below a threshold t_1
• Medium: Available deer count is between t_1 and t_2
• High: Available deer count is above a threshold t_2
Actions: Actions include go_hunt and no_hunting, where go_hunt implies catching
certain proportions of deer. It is important to note that for the empty state, the only
possible action is no_hunting.
Rewards: Hunting at each state generates rewards of some kind. The rewards for
hunting at different states, such as state low, medium, and high, maybe $5K, $50K,
and $100k, respectively. Moreover, if the action results in an empty state, the reward
is -$200K. This is due to the required e-breeding of new deer, which involves time and
money.
State transitions: Hunting in a state causes the transition to a state with fewer deer.
Subsequently, the action of no_hunting causes the transition to a state with more
deer, except for the ‘high’ state.

You might also like