21ai020 & Reinforcement Learning UNIT 1-LM:1
21ai020 & Reinforcement Learning UNIT 1-LM:1
UNIT 1-LM:1
TOPIC: THE REINFORCEMENT LEARNING PROBLEM
INTRODUCTION:
and seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty.
o The agent learns automatically using feedbacks without any labeled data,
his goal is to find the diamond. The agent interacts with the environment by
performing some actions, and based on those actions, the state of the agent gets
in the same state, and get feedback), and by doing these actions, he learns and
o The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the agent
it.
nature.
o Action(): Actions are the moves taken by an agent within the environment.
o Policy(): Policy is a strategy applied by the agent for the next action based
need to be taken.
o The agent takes the next action and changes states according to the feedback
are:
1. Value-based:
The value-based approach is about to find the optimal value function, which
is the maximum value at a state under any policy. Therefore, the agent
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to
apply such a policy that the action performed in each step helps to maximize
state.
the environment, and the agent explores that environment to learn it. There
2.Negative Reinforcement:
occur again by avoiding the negative condition. It can be more effective than
Examples
A gazelle calf struggles to its feet minutes after
in this game:
X O O
O X X
X
Figure 1.1: A sequence of tic-tac-toe moves. The solid lines
represent the moves taken during a game; the dashed lines
represent moves that we (our reinforcement learning player)
considered but did not make. Our second move was an
exploratory move, meaning that it was taken even though another
sibling move, the one leading to e∗, was ranked higher.
Exploratory moves do not result in any learning, but each of our
other moves does, causing backups as suggested by the curved
arrows and detailed in the text.