Sensor
Sensor
1 Abstract
According to [1], the road traffic can be divided into three categories in terms of human driver’s responsibility:
navigation, guidance and control. In this work, we focus on the guidance level in the high risk scenarios,
which is responsible for the output of desired trajectory and/or speed. First, We take the approach of
implementing deep imitation learning to obtain the driver agent model using data generated by predefined
dominant control law. Then, reinforcement learning is applied to find the policy in high risk scenarios via
switching control model considering both efficiency and safety.
2 Introduction
The autonomous driving technology grows rapidly recent years. However, the high risk scenario, where a
potential accident is likely to happen, could not be tackled well since the action need to be changed corre-
spondingly with other drivers. Hence, the agent may need to change actions significantly to stay safe.
According to the recent studies [2] [3], the reinforcement learning (RL) and imitation learning (IL) are two
dominant approaches for autonomous driving in learning actions. Reinforcement learning is applied to learn
the driving policies which maximize the reward function and imitation learning tries to learn the behavior
from expert, i.e. behaviour cloning. One shortcome of RL is that it need to fully explore the environment
while the IL requires large amount of expert demonstrations data. Both approaches are not suitable for
rapid transition of action since the action learned is continuous. Our approach intend to combine RL and
IL approaches to allow the driving agent switch actions accordingly to achieve safety requirement in near-
accident scenario.
First, we obtain driver agent models using deep imitation learning. The input to our algorithm is the dataset
generated from CARLO simulation containing the 4-D observation (location of the ego vehicle, velocity of
the ego vehicle, location of other vehicle with noise and velocity of other vehicle with noise), control input
(steering and throttle) and driving mode indicator. We then use Conditional Imitation Learning (CoIL) [4]
to output a predicted control input given observation and driving mode indicator.
The basic environment setup used is CARLO [4] to perform 2D driving simulation. Two different high risk
scenarios including cross traffic and wrong direction are tested. Different driving modes are evaluated based
on completion time and collision rate.
1
3 Related work
Imitation learning (IL) is one of the popular methods. Muller implement behavior cloning IL to solve off-
road obstacle avoidance [5]. The algorithm learns driving policy which consist of state-action pairs from
the dataset. One drawback is they suffer from the generalization of unpredicated behavior subject to new
test domain. Also it requires a huge amount of expert demonstrations and lead to low data efficiency. [6]
Codevilla proposed a method called Conditional Imitation Learning (CoIL) which extends IL with high-level
commands, as shown in Figure 1 [7]. It learns separate IL models for each high-level commands and some
features are shared between different learned IL models. It improves data efficiency but it requires high level
commands at test time. Our approach proproses to solve this problem using reinforcement learning to learn
agent providing the high-level commands instead of using commands provided from drivers.
Reinforcement learning (RL) is one main approach applied in autonomous driving [2]. It explores the
environment first and then take actions in each state which maximize the pre-defined reward. One short
come is that the state space in driving scenario is very large, which makes it hard to fully explore. According
to [8], Hierarchical Reinforcement Learning is proposed to solve this problem. It consist of multiple layers.
The higher layer acts like a manager to give goal for lower layers and lower layer acts like worker to achieve
the goal. It improves the exploration efficiency. Finally, Nair extend this Hierarchical RL to use expert
demonstrations to obtain the high level commands for the exploration of RL [9]. However, all the algorithms
could not address the problem at near-accident scenario since it is difficult to design the low-level reward
function for Hierarchical RL. Instead of using RL to learn the low-level policy, our approach first use IL to
obtain the low-level policy. Then we use RL to obtain the high-level commands.
6 Experiments/Results/Discussion
The preliminary experiments are done using the scenario proposed in [4]. The experiments are conducted in
two different scenarios:
(1) the first scenario is characterized by a ego car which is approaching a crossroad while ado car (simulated
by computer using pre-defined control law) is also approaching (figure 2a), which is called the intersection
scenario. For this scenario, there are overall three control mode dominating ago car: aggressive, normal and
easy. In the aggressive mode, the ado will perform in a way that will generally have higher speed and are
more likely to collide with ego car, while in the easy mode the collision is prevented. The normal mode is
designed in a way that the completion time and collision rate are both between aggressive mode and normal
mode. All three mode are hard coded control laws.
(2) The second scenario is simulating the situation where the ado car is driving in the opposite direction
towards the ego car, as shown in figure 2b. For this scenario, there are only two mode: aggressive and timid.
The simulator used in this experiment is CARLO, a customized 2D driving simulator that involves simple
dynamic model and visualizations. CARLO can handle 2D simulation with fast implementation in perception
and measurement data.
We assume a point-mass dynamic model in this scenario, and no other obstacle in the scenario. Considering
safety and efficiency, a test is defined as success when ego car reaches the target under a certain amount of
time without colliding into the ago car or other environment. The data we used are composed of two parts:
first part is provided by Erdem Bıyık, which is associated with the paper [1]; second part is generated by
hard coded control laws in the gym environment.
For both scenarios, the observations are: ego car’s location and velocity, ado car’s location and velocity. The
only difference between scenario 1 and 2 is in scenario 1, the dimension of ego car’s location is one (because
ego car only drives in a straight line), while in scenario 2, the ego car’s dimension is two. The observation
along with higher level command (timid or aggressive) is fed into neural network to obtain a policy that
minimizes the loss function defined by the difference between the ego car and expert behaviors.
The results of scenario 1 is shown in Figure 2a. As for now, for conditional imitation learning, we have
three results: ”COIL-aggressive”, ”COIL-middle” and ”COIL-timid”. ”Aggressive”, ”Timid” and ”Normal”
stand for the testing result of hard coded control policies before CoIL and RL. ”Random” shows the result
of policy combining three control modes randomly. Finally, ”RL-2 mode” and ”RL-3 mode” are the results
of using reinforcement learning when there are two control modes for selection (aggressive and timid), and
there are three control modes for selection (aggressive, middle and timid).
First, let’s compare the results of imitation learning and hard coded policy. In terms of the collision rate,
there is not much difference between normal and CoIL-normal or timid and CoIL-timid. The collision rate
for aggressive is 0.9 while for CoIL-aggressive is 0.24. When it comes to the completion time, the trend is
exactly on the opposite side. This is partly because if the ego car wants to behave aggressively, it will have
a comparatively higher speed, which will result in less time and a higher collision rate. It is no surprising
that the timid mode has the longest completion time, while the aggressive mode has the lowest completion
time.
A random policy also is included for later comparison with results from reinforcement learning. It can be
seen that the collision rate and completion time are in the middle of three hard code policy, which makes
sense because random policy chooses three modes with equal probability, and the model is closely related to
the ego car action, which is throttle in this case.
The results of scenario 2 is shown in figure 4. For this scenario, we only train the RL policy in 2 driving
mode. The overall performance of RL policy is similar to what we have seen in scenario 1. The completion
time of RL-2 mode is about the same as the random policy, but the collision rate of RL-2 mode is much
lower than random one (from 0.11 to 0.24).
7 Conclusion/Future Work
In summary, we can draw a conclusion that the approach - first using conditional imitation learning to learn
driving model from expert, then training a high level policy using reinforcement learning does a great job in
two scenarios. Although two scenarios shown in this project is rather simple and there are some assumptions
in the simulation, this does shed light on the application of CoIL-RL in more complicated scenario where
the motion planning of the vehicle is challenging due to the environment. However, there are still some work
remain to be done. First, more high risk scenarios including halting car, merge, unprotected Turn can be
used to evaluate the performance of different driving model and switching techniques. Second the simulation
can be done in CARLA [12] where the physical model is more realistic. Finally, the expert data can be
obtained from real drivers instead of hard coded policy.
[3] Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Proceedings of
the 1st International Conference on Neural Information Processing Systems, NIPS’88, page 305–313,
Cambridge, MA, USA, 1988. MIT Press.
[4] Zhangjie Cao, Erdem Biyik, Woodrow Z. Wang, Allan Raventos, Adrien Gaidon, Guy Rosman, and
Dorsa Sadigh. Reinforcement learning based control of imitative policies for near-accident driving. In
Science and Systems (RSS), July 2020.
[5] Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, and Yann L. Cun. Off-road obstacle avoidance through
end-to-end learning. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information
Processing Systems 18, pages 739–746. MIT Press, 2006.
[6] Felipe Codevilla, Matthias Müller, Alexey Dosovitskiy, Antonio López, and Vladlen Koltun. End-to-end
driving via conditional imitation learning. CoRR, abs/1710.02410, 2017.
[7] Felipe Codevilla, Matthias Muller, Antonio Lopez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-
end driving via conditional imitation learning. 2018 IEEE International Conference on Robotics and
Automation (ICRA), May 2018.
[8] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep
reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In D. D. Lee,
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information
Processing Systems 29, pages 3675–3683. Curran Associates, Inc., 2016.
[9] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming
exploration in reinforcement learning with demonstrations. CoRR, abs/1709.10089, 2017.
[10] Qingwen Xue, Ke Wang, Jian Lu, and Yujie Liu. Rapid driving style recognition in car-following using
machine learning and vehicle trajectory data. Journal of Advanced Transportation, 2019:1–11, 01 2019.
[11] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. CoRR, abs/1707.06347, 2017.
[12] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An
open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages
1–16, 2017.