0% found this document useful (0 votes)
5 views

final project

Uploaded by

chu030331
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

final project

Uploaded by

chu030331
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

simulate the reweighting process

using MATLAB. This approach


ensures the correctness and
Pin-Shiang Chu effectiveness of the algorithm
Departmet of Civil Engineering before applying it to the Gazebo
National Central University environment.
Taoyuan Taiwan
[email protected]
I. Introduction

This final project aims to


reinfocement learning algorithms
to enable drones to autonomously
learn how to navigate from a
starting point to an endpoint
while successfully avoiding
obstacles. To evaluate the
effectiveness of the algorithm, the
training process will be simulated
in a virtual environment. In the
future, the learning results can be III. Reinfocement Learning
applied in real-world settings, In this environment, it is
allowing drones to successfully highly suitable to use
complete autonomous flights in reinforcement learning to train
the real world based on their robots. Reinforcement learning is
training in the simulator. a type of machine learning where
II. Backgound a computer learns to correctly
perform a task through repeated
This project is designed to interactions with a dynamic
operate within the Gazebo environment. In the real world, it
environment, aiming to develop is not feasible to allow robots to
an algorithm that enables an constantly make trial-and-error
unmanned aerial vehicle (UAV) to choices, as this would be very
learn to fly from (6,3) to (6,-6). costly. Therefore, by training
Due to the time-consuming robots in a simulator, they can
nature of reweighting during the learn which behaviors to avoid
learning process, the and then achieve the desired
implementation will initially objectives in the real world. This
trial-and-error learning method stable.
enables computers to make a C. r: Immediate reward
series of decisions without human received after taking action
intervention and without being a in state s.
explicitly programmed to perform D. γ: Discount factor, with a
specific tasks. range between 0 and 1. It
determines the present
IV. Q-Learning
value of future rewards. A
Q-Learning is a reinforcement higher value means future
learning algorithm based on value rewards have a greater
iteration. Its fundamental concept impact on current decisions.
is to guide an agent in choosing E. Q(s′,a′): The maximum Q-
the optimal action by learning the value for all possible actions
value function Q(s,a) of state- a′ in the next state s′. This
action pairs. The Q-value represents the highest
represents the expected expected reward for the
cumulative reward obtained after next state.
taking action a in state s. The F. r+γmaxQ(s′,a′)−Q(s,a): This
algorithm approximates the part is called the Temporal
optimal value function by Difference (TD) error, which
continuously updating the Q- represents the difference
values. As for how to calculate the between the current
Q-value, we can use the following estimate and the newly
formula: observed information. It
reflects the gap between
Q(s,a)=Q(s,a)+α[r+γmaxQ(s′,a′)−Q(s,a)]
the current Q-value and the
A. Q(s,a): This is the current Q- updated Q-value.
value for taking action aaa
in state s, which we want to V. Algorithm
update.
For episode
B. α: Learning rate, with a
range between 0 and 1. It 1.a = max (Q(s, a))(ε-greedy)
determines the weight
2.Get reward(Q(s, a))
given to new information. A
higher value makes learning 3.Q(s,a)=Q(s,a)+α[r+γmaxQ(s′,a′)
faster but more unstable, −Q(s,a)]
while a lower value makes
A. Chose a: With a probability
learning slower but more
of x, a direction is chosen B. Gradient Descent for
randomly, and with a Parameter Updates
probability of 1−x, the
To update the parameters ,
direction with the highest Q-
we use gradient descent to
value is chosen.
minimize the loss function,
B. Calculate TD Error:
typically defined as the squared
r+γmaxQ(s′,a′)−Q(s,a) The TD
error represents the TD error:(x)=[r+γmax⁡a′Q(s′,a′;x)
difference between the −Q(s,a;x)]^2
current Q-value and the
The update rule for xis: x←x+α∇x
estimated Q-value after the
Q(s,a;x)
update.
C. Update Q-value: C. Algorithm
Q(s,a)←Q(s,a)+αδ The learning
For episode
rate α controls the impact of
the TD error, adjusting the 1.a = max (Q(s, a))
Q-value incrementally.
2.Get reward(Q(s, a))

VI. Proposed Method 3.W:= W + α * [r + γ * max(Q(s', a'))


- Q(s, a)*X
The above method involves
recording the Q-values for each D. Parameter Design
state-action pair to build a Q- i. The feature design of the Q
table. However, when the state or function selects [f1, f2, f3,
action space is very large or f4, f5] to represent:
continuous, this approach 1. Constant
becomes impractical. In such 2. abs(goal.x - robot.x)
cases, we can use function 3. abs(goal.y - robot.y)
approximation to replace the Q- 4. abs(atan(goal.y - robot.y,
table. This method is known as goal.x - robot.x) - robot_t)
function-based Q-Learning. We  Represents the
can define our own Q-function for difference in angle
this purpose. between the drone
and the target angle.
A. Function Based Q- 5. sqrt(pow(robot.x - goal.x, 2)
Learning
+ pow(robot.y - goal.y, 2))
Q(s,a;x)←Q(s,a;x)+α[r+γmaxa′ Q(s  Represents the
′,a′;x)−Q(s,a;x)] absolute distance.
Therefore, the Q function can issues were encountered. First,
be expressed as [f1, f2, f3, f4, f5] during training in Gazebo, it's
* [x1, crucial to be mindful when
x2,
detecting obstacles through ROS.
x3,
The UAV may not maintain a
x4,
perfectly horizontal state during
x5]
flight, which could inadvertently
ii. The reward is designed such lead to floor detection being
that encountering an mistaken for obstacles. This can
obstacle or a wall results in result in training errors. Another
-10, reaching the endpoint
issue arose during the weight
gives a reward of 10, and
updating process. Initially, when
for all other situations the
reward is -0.05. the UAV collided with an obstacle,
iii. The actions are designed to the intention was to reset the
include robot's position to (6, 3, 1) using a
five directions: function. However, it was
Left 30 degrees
discovered that this method
Left 15 degrees
caused the UAV to fall due to the
Forward (straight ahead)
Right 15 degrees lack of initial velocity.
Right 30 degrees Consequently, weights needed to
iv. The satates: be saved to a file after each round
and then extracted from the file
for the next round of training.

VII. Conclusion

During the process of working on


the final project, several critical

You might also like