Integrating Deep Reinforcement Learning With Model-Based Path Planner
Integrating Deep Reinforcement Learning With Model-Based Path Planner
state space and the reward function. An illustration of our C. Integrating path planning into model-free DRL frame-
formulation is shown in Figure 2. works
The main contribution of this work is the integration of
B. Reinforcement Learning path planning into DRL frameworks. We achieve this by
modifying the state-space with the addition of d. Also, the
Reinforcement learning is an umbrella term for a large reward function is changed to include a new reward term rw ,
number of algorithms derived for solving the Markov Deci- which rewards being close to the nearest waypoint obtained
sion Problems (MDP) [21]. from the model-based path planner, i.e. a small d. Utilizing
In our framework, the objective of reinforcement learning waypoints to evaluate a DRL framework were suggested in a
is to train a driving agent who can execute ‘good’ actions so very recent work [30], but their approach does not consider
that the new state and possible state transitions until a finite integrating the waypoint generator into the model.
expectation horizon will yield a high cumulative reward. The The proposed reward function is as follows.
overall goal is quite straightforward for driving: not making
collisions and reaching the destination should yield a good r = βc rc + βv rv + βl rl + βw rw (3)
reward and vice versa. It must be noted that RL frameworks Where rc is the no-collision reward, rv is the not driving
are not greedy unless γ = 0. In other words, when an very slow reward, rl is being-close to the destination reward,
action is chosen, not only the immediate reward but the and rw is the proposed being-close to the nearest waypoint
cumulative rewards of all the expected future state transitions reward. The distance to the nearest waypoint d is shown
are considered. in Figure 2. The weights of these rewards, βc , βv βl , βw , are
Here we employ DQN [25] to solve the MDP problem parameters defining the relative importance of rewards. These
described above. The main idea of DQN is to use neural parameters are determined heuristically. In the special case
networks to approximate the optimal action-value function of βc = βv = βl = 0, the integrated model should mimic
Q(s, a). This Q function maps the state-action space to R. the model-based planner.
Q : S × A → R while maximizing equation 1. The problem Please note that any planner, from the naive A* to more
comes down to approximiate or to learn this Q function. The complicated algorithms with complete obstacle avoidance
following loss function is used for Q-learning at iteration i. capabilities, can be integrated into this framework as long
as they provide a waypoint.
Li (θ) =
" 2 # IV. E XPERIMENTS
θi− θi As in all RL frameworks, the agent needs to interact
E(s,a,r) r + γmaxQ (st+1 , at+1 ) − Q (st , at )
at+1
with the environment and fail a lot to learn the desired
(2) policies. This makes training RL driving agents in real-world
extremely challenging as failed attempts cannot be tolerated.
Where Q-Learning updates are applied on samples As such, we focused only on simulations in this study. Real-
(s, a, r) ∼ U (D). U (D) draws random samples from the world adaptation is outside of the scope of this work.
data batch D. θi is the Q-network parameters and θi− is the The proposed method was implemented in Python based
target network parameters at iteration i. Details of DQN can on an open-source RL framework [31] and CARLA [32]
be found in [25]. was used as the simulation environment. The commonly used
Fig. 4. The experimental process: I. A random origin-destination pair was selected. II. The A* algorithm was used to generate a path. III. The hybrid
DRL agent starts to take action with the incoming state stream. IV. The end of the episode.