We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
simulate the reweighting process
using MATLAB. This approach
ensures the correctness and Pin-Shiang Chu effectiveness of the algorithm Departmet of Civil Engineering before applying it to the Gazebo National Central University environment. Taoyuan Taiwan [email protected] I. Introduction
This final project aims to
reinfocement learning algorithms to enable drones to autonomously learn how to navigate from a starting point to an endpoint while successfully avoiding obstacles. To evaluate the effectiveness of the algorithm, the training process will be simulated in a virtual environment. In the future, the learning results can be III. Reinfocement Learning applied in real-world settings, In this environment, it is allowing drones to successfully highly suitable to use complete autonomous flights in reinforcement learning to train the real world based on their robots. Reinforcement learning is training in the simulator. a type of machine learning where II. Backgound a computer learns to correctly perform a task through repeated This project is designed to interactions with a dynamic operate within the Gazebo environment. In the real world, it environment, aiming to develop is not feasible to allow robots to an algorithm that enables an constantly make trial-and-error unmanned aerial vehicle (UAV) to choices, as this would be very learn to fly from (6,3) to (6,-6). costly. Therefore, by training Due to the time-consuming robots in a simulator, they can nature of reweighting during the learn which behaviors to avoid learning process, the and then achieve the desired implementation will initially objectives in the real world. This trial-and-error learning method stable. enables computers to make a C. r: Immediate reward series of decisions without human received after taking action intervention and without being a in state s. explicitly programmed to perform D. γ: Discount factor, with a specific tasks. range between 0 and 1. It determines the present IV. Q-Learning value of future rewards. A Q-Learning is a reinforcement higher value means future learning algorithm based on value rewards have a greater iteration. Its fundamental concept impact on current decisions. is to guide an agent in choosing E. Q(s′,a′): The maximum Q- the optimal action by learning the value for all possible actions value function Q(s,a) of state- a′ in the next state s′. This action pairs. The Q-value represents the highest represents the expected expected reward for the cumulative reward obtained after next state. taking action a in state s. The F. r+γmaxQ(s′,a′)−Q(s,a): This algorithm approximates the part is called the Temporal optimal value function by Difference (TD) error, which continuously updating the Q- represents the difference values. As for how to calculate the between the current Q-value, we can use the following estimate and the newly formula: observed information. It reflects the gap between Q(s,a)=Q(s,a)+α[r+γmaxQ(s′,a′)−Q(s,a)] the current Q-value and the A. Q(s,a): This is the current Q- updated Q-value. value for taking action aaa in state s, which we want to V. Algorithm update. For episode B. α: Learning rate, with a range between 0 and 1. It 1.a = max (Q(s, a))(ε-greedy) determines the weight 2.Get reward(Q(s, a)) given to new information. A higher value makes learning 3.Q(s,a)=Q(s,a)+α[r+γmaxQ(s′,a′) faster but more unstable, −Q(s,a)] while a lower value makes A. Chose a: With a probability learning slower but more of x, a direction is chosen B. Gradient Descent for randomly, and with a Parameter Updates probability of 1−x, the To update the parameters , direction with the highest Q- we use gradient descent to value is chosen. minimize the loss function, B. Calculate TD Error: typically defined as the squared r+γmaxQ(s′,a′)−Q(s,a) The TD error represents the TD error:(x)=[r+γmaxa′Q(s′,a′;x) difference between the −Q(s,a;x)]^2 current Q-value and the The update rule for xis: x←x+α∇x estimated Q-value after the Q(s,a;x) update. C. Update Q-value: C. Algorithm Q(s,a)←Q(s,a)+αδ The learning For episode rate α controls the impact of the TD error, adjusting the 1.a = max (Q(s, a)) Q-value incrementally. 2.Get reward(Q(s, a))
VI. Proposed Method 3.W:= W + α * [r + γ * max(Q(s', a'))
- Q(s, a)*X The above method involves recording the Q-values for each D. Parameter Design state-action pair to build a Q- i. The feature design of the Q table. However, when the state or function selects [f1, f2, f3, action space is very large or f4, f5] to represent: continuous, this approach 1. Constant becomes impractical. In such 2. abs(goal.x - robot.x) cases, we can use function 3. abs(goal.y - robot.y) approximation to replace the Q- 4. abs(atan(goal.y - robot.y, table. This method is known as goal.x - robot.x) - robot_t) function-based Q-Learning. We Represents the can define our own Q-function for difference in angle this purpose. between the drone and the target angle. A. Function Based Q- 5. sqrt(pow(robot.x - goal.x, 2) Learning + pow(robot.y - goal.y, 2)) Q(s,a;x)←Q(s,a;x)+α[r+γmaxa′ Q(s Represents the ′,a′;x)−Q(s,a;x)] absolute distance. Therefore, the Q function can issues were encountered. First, be expressed as [f1, f2, f3, f4, f5] during training in Gazebo, it's * [x1, crucial to be mindful when x2, detecting obstacles through ROS. x3, The UAV may not maintain a x4, perfectly horizontal state during x5] flight, which could inadvertently ii. The reward is designed such lead to floor detection being that encountering an mistaken for obstacles. This can obstacle or a wall results in result in training errors. Another -10, reaching the endpoint issue arose during the weight gives a reward of 10, and updating process. Initially, when for all other situations the reward is -0.05. the UAV collided with an obstacle, iii. The actions are designed to the intention was to reset the include robot's position to (6, 3, 1) using a five directions: function. However, it was Left 30 degrees discovered that this method Left 15 degrees caused the UAV to fall due to the Forward (straight ahead) Right 15 degrees lack of initial velocity. Right 30 degrees Consequently, weights needed to iv. The satates: be saved to a file after each round and then extracted from the file for the next round of training.
Download Complete (Ebook) Oxford Textbook of Epilepsy and Epileptic Seizures by Mark Cook, Samden Lhatoo ISBN 9780199659043, 0199659044 PDF for All Chapters