0% found this document useful (0 votes)
5 views13 pages

A Deep Reinforcement Learning Framework for Eco-Driving in Connected and Automated Hybrid Electric Vehicles

Uploaded by

esoguceng2016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

A Deep Reinforcement Learning Framework for Eco-Driving in Connected and Automated Hybrid Electric Vehicles

Uploaded by

esoguceng2016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO.

2, FEBRUARY 2024 1713

A Deep Reinforcement Learning Framework for


Eco-Driving in Connected and Automated
Hybrid Electric Vehicles
Zhaoxuan Zhu , Member, IEEE, Shobhit Gupta , Member, IEEE, Abhishek Gupta , Member, IEEE,
and Marcello Canova , Member, IEEE

Abstract—Connected and Automated Vehicles (CAVs), in par- have the potential to operate in a more time- and fuel-efficient
ticular those with multiple power sources, have the potential to manner [1]. With Vehicle-to-Vehicle (V2V) and Vehicle-to-
significantly reduce fuel consumption and travel time in real-world Infrastructure (V2I) communication, the controller has access to
driving conditions. In particular, the eco-driving problem seeks
to design optimal speed and power usage profiles based upon real-time look-ahead information including the terrain, infras-
look-ahead information from connectivity and advanced mapping tructure and surrounding vehicles. Intuitively, with connectivity
features, to minimize the fuel consumption over a given itinerary. technologies, controllers can plan a speed profile that allows the
In this work, the eco-driving problem is formulated as a Partially ego vehicle to intelligently pass more signalized intersections in
Observable Markov Decision Process (POMDP), which is then
green phases with less change in speed. This problem is formu-
solved with a state-of-art Deep Reinforcement Learning (DRL) Ac-
tor Critic algorithm, Proximal Policy Optimization. An eco-driving lated as the eco-driving problem (incorporating Eco-Approach
simulation environment is developed for training and evaluation and Departure at signalized intersections), which aims to min-
purposes. To benchmark the performance of the DRL controller, imize the fuel consumption and the travel time between two
a baseline controller representing the human driver, a trajectory designated locations by co-optimizing the speed trajectory and
optimization algorithm and the wait-and-see deterministic optimal the powertrain control strategy [2], [3].
solution are presented. With a minimal onboard computational
requirement and a comparable travel time, the DRL controller The literature related to the eco-driving problem distinguishes
reduces the fuel consumption by more than 17% compared against among two aspects, namely, powertrain configurations and traf-
the baseline controller by modulating the vehicle velocity over the fic scenarios. Regarding powertrain configuration, the difference
route and performing energy-efficient approach and departure at is in whether the powertrain is equipped with a single power
signalized intersections, over-performing the more computation- source [3], [4], [5], [6] or a hybrid electric architecture [7], [8],
ally demanding trajectory optimization method.
[9], [10]. The latter involves modeling multiple power sources
Index Terms—Connected and automated vehicle, eco-driving, and devising optimal control algorithms that can synergistically
deep reinforcement learning, dynamic programming, long short- split the power demand to efficiently utilize the electric energy
term memory.
stored in the battery. Maamria et al. [11] systematically compare
the computational requirement and the optimality of different
I. INTRODUCTION eco-driving formulations solved offline via Deterministic Dy-
namic Programming (DDP).
ITH the advancement in the vehicular connectivity and
W autonomy, Connected and Automated Vehicles (CAVs)
Related to the traffic scenarios, Ozatay et al. [4] proposed a
framework providing advisory speed profile using online opti-
mization conducted on a cloud-based server without considering
Manuscript received 31 December 2022; revised 29 August 2023; accepted the real-time traffic light variability. Olin et al. [9] implemented
10 September 2023. Date of publication 22 September 2023; date of current
version 13 February 2024. This work was supported in part by the United States the eco-driving framework to evaluate real-world fuel economy
Department of Energy, Advanced Research Projects Agency–Energy (ARPA- benefits obtained from a control logic computed in a Rapid
E) NEXTCAR Project under Grant DE-AR0000794 and in part by The Ohio Prototyping System on-board a test vehicle. As traffic lights are
Supercomputer Center. The review of this article was coordinated by Dr. Cailian
Chen. (Corresponding author: Zhaoxuan Zhu.) not explicitly considered in these studies, the eco-driving control
Zhaoxuan Zhu was with the Center for Automotive Research, The Ohio State module is required to be coupled with other decision-making
University, Columbus, OH 43212 USA. He is now with the Motional, Boston, agents, such as human drivers or Adaptive Cruise Control (ACC)
MA 02210 USA (e-mail: [email protected]).
Shobhit Gupta was with the Center for Automotive Research, The Ohio State systems. Other studies have explicitly modeled and considered
University, Columbus, OH 43212 USA. He is now with the General Motors Signal Phase and Timings (SPaTs). Jin et al. [3] formulated the
Research and Development, Warren, MI 48092 USA (e-mail: gupta.852@ problem as a Mixed Integer Linear Programming (MILP) for
buckeyemail.osu.edu).
Abhishek Gupta is with the Department of Electrical and Computer Engineer- conventional vehicles with Internal Combustion Engine (ICE).
ing, The Ohio State University, Columbus, OH 43210 USA (e-mail: gupta.706@ Asadi et al. [12] used traffic simulation models and proposed to
osu.edu). solve the problem considering probabilistic SPaT with DDP. Sun
Marcello Canova is with the Center for Automotive Research, The Ohio State
University, Columbus, OH 43212 USA (e-mail: [email protected]). et al. [6] formulated the eco-driving problem as a distributionally
Digital Object Identifier 10.1109/TVT.2023.3318552 robust stochastic optimization problem with collected real-world
0018-9545 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1714 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024

data. Guo et al. [8] proposed a bi-level control framework


with a hybrid vehicle. Bae [10] extended the work in [6] to
a heuristic HEV supervisory controller. Deshpande et al. [13],
[14] designed a hierarchical Model Predictive Control (MPC)
strategy to combine a heuristic logic with intersection passing
capability with the strategy in [9]. Guo [15], [16] developed
hierarchical planning and control modules for the fuel efficiency
in vehicle platooning, and later on extended the work to vehicle
fleets [17], [18].
The dimensionality of the problem, therefore the compu-
tational requirements, can become quickly intractable as the
number of states increases. This is the case, for instance, when
the energy management of a hybrid powertrain system is com-
bined with velocity optimization in presence of other vehicles Fig. 1. Structure of The Environment Model.
or approaching signalized intersections. The aforementioned
methods either consider a simplified powertrain model [10], Compared to the aforementioned studies, this article demon-
[19] or treat the speed planning and the powertrain control strates that, by using modern DRL algorithm along with an
hierarchically [8]. Although such efforts made the real-time efficient and high-fidelity environment model, a statistical supe-
implementation feasible, the optimality can be sacrificed [11]. rior policy with reasonable onboard computation can be learned
The use of Deep Reinforcement Learning (DRL) in the context for the task of eco-driving for HEVs under real-life driving
of eco-driving has caught considerable attention in recent years. routes, which is very challenging to tackle onboard by classical
DRL provides a train-offline, execute-online methodology with approaches.
which the policy is learned from historical data or by inter- To benchmark the performance of the resultant explicit pol-
acting with a simulated environment. The offline training can icy, we present a baseline strategy representing human driving
either result in an explicit policy or be implemented as part behaviors with a rule-based energy management module, a
of MPC as the terminal cost function. Shi et al. [20] modeled trajectory optimization strategy in [13], [14] and a wait-and-see
the conventional vehicles with ICE as a simplified model and deterministic optimal solution. The comparison was conducted
implemented Q-learning to minimize the CO2 emission at sig- over 100 randomly generated trips with Urban and Mixed Urban
nalized intersections. Li et al. [21] apply an actor-critic algorithm driving scenarios.
on the ecological ACC problem in car-following mode. Pozzi The remainder of the article is organized as follows. Section II
et al. [22] designed a velocity planner with Deep Deterministic presents the simulation environment. Section III introduces the
Policy Gradient (DDPG) that operates on top of an ACC module preliminaries of the DRL algorithm employed in this work.
for safety concerns and considers the signalized intersection and Section IV mathematically formulates the eco-driving problem,
hybrid powertrain configuration. and Section V presents the proposed DRL controller. Section VI
This work focuses on the development of the eco-driving presents the strategies used for benchmarking. Section VII
controller for HEVs with the capability to pass signalized in- shows the training details and benchmarks the performance.
tersections autonomously under urban and highway conditions.
The contribution of this work is threefold.
a) Compared to the previous applications of DRL on the II. ENVIRONMENT MODEL
eco-driving problem [20], [21], the eco-driving problem is The successful training of the reinforcement learning agent
formulated as a centralized problem where a physics-based relies on an environment to provide the data. In particular,
quasi-static nonlinear hybrid electric powertrain model is model-free reinforcement learning methods typically require
considered. a large amount of data before agents learn the policy via the
b) To overcome the intensive onboard computation, a novel interaction with the environment. In the context of eco-driving,
Partially Observable Markov Decision Process (POMDP) collecting such an amount of real-world driving data is expen-
eco-driving formulation is proposed and subsequently sive. Furthermore, the need for policy exploration during train-
solved with an actor-critic DRL algorithm, Proximal Pol- ing poses safety concerns for human operators and hardware.
icy Optimization (PPO), along with Long Short-Term For these reasons, a model of the environment is developed
Memory (LSTM) as the function approximators. In ad- for training and validation purposes. The environment model,
dition, the design of the reward mechanism, particularly demonstrated in Fig. 1, consists of a Vehicle Dynamics and
regarding the behaviors at signalized intersections, is dis- Powertrain (VD&PT) model and a microscopic traffic simulator.
cussed in detail. The environment model is discretized with time difference of 1
c) A co-simulation framework that integrated powertrain s. The controller commands three control inputs, namely, the
model and traffic simulation is proposed such that routes ICE torque, the electric motor torque and the mechanical brake
directly sampled from city map on large scale can be used torque. The component-level torques collectively determine the
for training and evaluation. HEV powertrain dynamics, the longitudinal dynamics of the ego

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1715

2) BSG Model: In a P0 configuration, the BSG is conneted to


the engine via a belt, as shown in (2). A simplified, quasi-static
efficiency map (η(ωbsg,t , Tbsg,t )) is used to compute the electrical
power output Pbsg,t in both regenerative braking and traction
operating modes:
ωbsg,t = τbelt ωeng,t , (2)

η(ωbsg,t , Tbsg,t ), Tbsg,t < 0
Pbsg,t = Tbsg,t ωbsg,t 1 (3)
η(ωbsg,t ,Tbsg,t ) , Tbsg,t > 0

where τbelt , ωbsg,t and Tbsg,t refer to the belt ratio, the BSG
Fig. 2. Block Diagram of 48 V P0 Mild-Hybrid Drivetrain. angular velocity and the BSG torque, respectively.
3) Battery Model: A zero-th order equivalent circuit model is
vehicle and its location along the trip. While the states of vehicle used to model the current (It ) dynamics. Coulomb counting [24]
and powertrain such as battery State-of-Charge (SoC), velocity is used to compute the battery SoC:

and gear are readily available to the powertrain controller, the VOC (SoCt ) − VOC 2 (SoC ) − 4R (SoC )P
t 0 t bsg,t
availability of the connectivity information depends on the in- It = ,
2R0 (SoCt )
frastructure and the types of sensors equipped onboard. In this (4)
study, it is assumed that Dedicated Short Range Communication
(DSRC) [12] sensors are available onboard, and SPaT becomes Δt
SoCt+1 = SoCt − (It + Ia ), (5)
available and remains accurate once it enters the 200 m range. Cnom
The uncertainties caused by sensor unavailability and inaccuracy where Δt is the time discretization, which is set to 1 s in this
in SPaT, as studied in [6], [23] in SPaT, is not considered in the study. The power consumed by the auxiliaries is modeled by a
simulation model or in the study. While adding uncertainties in calibrated constant current bias Ia . The cell open circuit voltage
the traffic model is left as future work, such uncertainties can be VOC (SoCt ) and internal resistance R0 (SoCt ) data are obtained
ingested by the Markov Decision Process (MDP) formulation. from the pack supplier.
Thus, the model-free DRL problem formulation is expected 4) Torque Converter Model: A simplified torque converter
to remain the same. The DRL agent utilizes the SPaT from model is developed to compute the losses during traction and
the upcoming traffic light while ignoring the SPaT from any regeneration modes. Here, the lock-up clutch is assumed to be
other traffic light regardless of the availability. Specifically, always actuated, applying a controlled slip ωslip between the
the distance to the upcoming traffic light, its status and SPaT turbine and the pump. The assumption might be inaccurate
program are fed into the controller as observations. Finally, a during launches, and this can be compensated by including a
navigation application with Global Positioning System (GPS) is fuel consumption penalty in the optimization problem, associ-
assumed to be on the vehicle such that the locations of the origin ated to each vehicle launch event. This model is described as
and the destination, the remaining distance, the speed limits of follows [25]:
the entire trip are available at every point during the trip.
Ttc,t = Tpt,t , (6)
A. Vehicle and Powertrain Model ωp,t = ωtc,t + ωslip (ng,t , ωeng,t , Teng,t ) , (7)
A forward-looking dynamic powertrain model is developed ⎧

⎨ωp,t , ωp,t ≥ ωstall
for fuel economy evaluation and control strategy verification
ωeng,t = ωidle,t , 0 ≤ ωp,t < ωstall (8)
over real-world routes. In this work, a P0 mild-hybrid electric ⎪

vehicle (mHEV) is considered, equipped with a 48 V Belted 0, 0 ≤ ωp,t < ωstall and Stop = 1
Starter Generator (BSG) performing torque assist, regenerative
where ng is the gear number, ωp,t is the speed of the torque
braking and start-stop functions. The diagram of the powertrain
converter pump, ωtc,t is the speed of the turbine, ωstall is the speed
is illustrated in Fig. 2. The key components of the low-frequency
at which the engine stalls, ωidle,t is the idle speed of the engine,
quasi-static model are described below.
stop is a flag from the ECU indicating engine shut-off when the
1) Engine Model: The engine is modeled as low-frequency
vehicle is stationary, Ttc,t is the turbine torque, and Tpt,t is the
quasi-static nonlinear maps. The fuel consumption and the
combined powertrain torque. The desired slip ωslip is determined
torque limit maps are based on steady-state engine test bench
based on the powertrain conditions and desired operating mode
data provided by a supplier:
of the engine (traction or deceleration fuel cut-off).
ṁfuel,t = ψ (ωeng,t , Teng,t ) , (1) 5) Transmission Model: The transmission model is based on
a static gearbox, whose equations are as follows:
where the second subscript t represents the discrete time index,
ωtc,t = τg (ng,t )ωtrans,t
and ωeng and Teng are the engine angular velocity and torque,
respectively. = τg (ng,t )τfdr ωout,t

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1716 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024

Fig. 3. Validation of Vehicle Velocity, SoC and Fuel Consumed over FTP
Cycle.
Fig. 4. Map of Columbus, OH for DRL Training [Each red and blue marker
denotes the start and end point of an individual trip and the colored line denote
vveh,t the route between these points.].
= τg (ng,t )τfdr , (9)
Rw
Ttrans,t = τg (ng,t )Ttc,t , (10) where the battery SoC and fuel consumption are compared
⎧ against experimental data.
⎨τfdr ηtrans (ng,t , Ttrans,t , ωtrans,t )Ttrans,t , Ttrans,t ≥ 0 The mismatches in the battery SoC profiles can be attributed
Tout,t = τfdr to the simplicity of the battery model, in which the electrical
⎩ Ttrans,t , Ttrans,t < 0
ηtrans (ng,t , Ttrans,t , ωtrans,t ) accessory loads are modeled using a constant current bias. The
(11) fuel consumption over the FTP cycle is well estimated by the
model, with an error on the final value less than 4% relative to
where τg and τfdr are the gear ratio and the final drive ratio,
the experimental data.
respectively. The transmission efficiency ηtrans (ng , Ttrans , ωtrans )
is scheduled as a nonlinear map expressed as a function of gear
number ng , transmission input shaft torque Ttrans,t and transmis- B. Traffic Model
sion input speed ωtrans . ωout,t refers to the angular velocity of the A large-scale microscopic traffic simulator is developed
wheels. Rw and vveh,t are the radius of the vehicle wheel and the in the open source software Simulation of Urban Mobility
longitudinal velocity of the vehicle, respectively. (SUMO) [26] as part of the environment. To recreate realistic
6) Vehicle Longitudinal Dynamics Model: The vehicle dy- mixed urban and highway trips for training, the map of the city
namics model is based on the road-load equation, which indi- of Columbus, OH, USA is downloaded from the online database
cates the tire rolling resistance, road grade, and aerodynamic OpenStreetMap [27]. The map contains the length, shape, type
drag: and speed limit of the road segments and the detailed program
of each traffic light in signalized intersections.
Tout,t − Tbrk,t 1 Cd ρa Af 2
aveh,t = − vveh,t Fig. 4 highlights the area (∼ 11 km by 11 km) covered in
M Rw 2 M the study. In the area, 10,000 random passenger car trips are
− gCr cos αt vveh,t − g sin αt . (12) generated as the training set, and the total distance of each trip
is randomly distributed from 5 km to 10 km. Another 100 trips,
Here, aveh,t is the longitudinal acceleration of the vehicle, Tbrk is with which the origins and the destinations are marked in red
the brake torque applied on wheel, M is the mass of the vehicle, and blue in Fig. 4, respectively, are generated following the same
Cd is the aerodynamic drag coefficient, ρa is the air density, Af distribution as the testing set. In addition, the inter-departure
is the effective aerodynamic frontal area, Cr is rolling resistance time of each trip follows a geometric distribution with the
coefficient, and αt is the road grade. success rate p = 0.01. The variation and the randomness of the
7) Vehicle Model Verification: The forward model is then trips used for training enhance the richness of the environment,
calibrated and verified using experimental data from chassis which subsequently leads to a learned policy that is less subject to
dynamometer testing. The key variables used for evaluating the local minima and agnostic to specific driving conditions (better
model are vehicle velocity, battery SoC, gear number, engine generalizability) [28].
speed, desired engine and BSG torque profiles, and fuel con- The interface between the traffic simulator and the VD&PT
sumption. Fig. 3 show sample results from model verification model is established via Traffic Control Interface (TraCI) as part
over the Federal Test Procedure (FTP) regulatory drive cycle, of the SUMO package. At any given time step, the kinetics of the

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1717

vehicle calculated from VD&PT is fed to the traffic simulator as where ρ(s) is the discounted on-policy state distribution defined
input. Subsequently, SUMO determines the location of the ego as follows:
vehicle, updates the connectivity information such as the SPaT ∞
of the upcoming traffic light and the GPS signal and returns them ρ(s) = γ t P (st = s). (18)
to the agent as part of the observations. t=0

As in [33], the gradient can be estimated as follows:


III. DEEP REINFORCEMENT LEARNING PRELIMINARIES
∇θ πθk (at |st )
A. Markov Decision Process ∇θ η(πθk ) = Eπ Aπθk (st , at ) . (19)
πθk (at |st )
In Markov Decision Process (MDP), sequential decisions are
Accordingly, to incrementally increase η(πθ ), the gradient as-
made in order to maximize the discounted sum of the rewards.
cent rule follows
An MDP can be defined by a tuple S, A, P, ρ0 , r, γ, where S
and A are the state space and the action space, respectively; P : θk+1 = θk + αk ∇θ η(πθk ), (20)
S × A × S → [0, 1] is the transition dynamics distribution; ρ0 is
the initial distribution of the state space. The reward function r : where αk is the learning rate. As (19) is a local estimation of the
S × A × S → R is a function that maps the tuple (st , at , st+1 ) gradient at the neighborhood of the current policy, updating the
to instantaneous reward. Finally, γ is the discount factor that policy in such a direction with a large step size could potentially
prioritizes the immediate reward and ensures that the summation lead to a large performance drop. Schulman et al. [34] proposed
over infinite horizon is finite. to constrain the difference between the probability distributions
Let π : S × A → [0, 1] be a randomized policy and Π be of the old and the new policy with the trust region method.
the set of all randomized policies. The objective of MDP is to Although being less brittle, the algorithm requires the analytical
find the optimal policy π ∗ that minimizes the expectation of the Hessian matrix, resulting in a high computational load and a
discounted sum of the rewards defined as follows: nontrivial implementation. In this article, a first-order method
proposed by Schulman et al. [35] is used. Instead of (19), a
π ∗ = argmaxπ∈Π η(π), where clipped surrogate objective function is defined as follows:
∞
Lt (θ) = Eπ [min (rt (θ),
η(π) = Est+1 ∼P(·|st ,at ) γ t r (st , at ) ,
t=0 clip (rt (θ), 1 − , 1 + )) Aπθk (st , at )] ,
where s0 ∼ ρ0 (·), at ∼ π(·|st ). (13) πθ (at |st )
where rt (θ) = ,
πθk (at |st )
In the remaining work, the expectation under the state trajectory
Est+1 ∼P (·|st ,at ) [·] will be written compactly as Eπ [·]. For any clip(x, amin , amax ) = min (max (x, amin ) , amax ) . (21)
policy π, the value function V π : S → R, the Q function Qπ :
Here, the hyperparameter is the clipping ratio. Note the first-
S × A → R and the advantage function Aπ : S × A → R are
order derivative of the loss function around θk , ∇θ Lt (θ)|θk , is
defined as follows:
∞ equal to ∇θ η(πθ )|θk which is consistent with Policy Gradient
Theorem.
V π (st ) = Eπ γ i−t r (si , ai ) |st , (14)
i=t
 C. Partially Observable Markov Decision Process

Qπ (st , at ) = Eπ γ i−t r (si , ai ) |st , at , (15) In many practical applications, states are not fully observable.
i=t The partial observability can arise from sources such as the need
to remember the states in history, the sensor availability or noise
A (st , at ) = Q (st , at ) − V π (st ).
π π
(16) and unobserved variations of the plant under control [36]. Such a
problem can be modeled as a POMDP where observations ot ∈
B. Actor-Critic Algorithm O, instead of states, are available to the agent. The observation at
The actor-critic algorithm is one of the earliest concepts in the certain time follows the conditional distribution given the current
field of reinforcement learning [29], [30]. The actor is typically state ot ∼ P (·|st ).
a stochastic control policy, whereas the critic is a value function For POMDP, the optimal policy, in general, depends on the
assisting the actor to improve the policy. For DRL, both the actor entire history ht = (o1 , a1 , o2 , a2 , . . .at−1 , ot ), i.e. at ∼ π(·|ht ).
and critic are typically in the form of deep neural networks. The optimal policy can be obtained by giving policies the internal
In this study, policy gradient method is used to iteratively memory to access the history ht . In [37], it is shown that
improve the policy. According to Policy Gradient Theorem the policy gradient theorem can be used to solve the POMDP
in [31], [32], the gradient of the policy, parameterized by θ, problem with Recurrent Neural Network (RNN) as the function
can be determined by the following equation: approximator, i.e. Recurrent Policy Gradients. Compared to
other function approximators such as Multilayer Perceptron
∇θ η(πθk ) ∝ ρ(s) Qπθk (s, a)∇θ πθk (a|s), (17) (MLP) or Convolutional Neural Network (CNN), RNN exploits
s a the sequential property of inputs and uses internal states for

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1718 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024

memory. Specifically, Long Short-Term Memory (LSTM) [38], TABLE I


OBSERVATION AND ACTION SPACE OF THE ECO-DRIVING PROBLEM
as a special architecture of RNN, is typically used in DRL
to avoid gradient explosion and gradient vanishing, which are
the well-known issues to RNN [39]. In LSTM, three types of
gates are used to keep the memory cells activated for arbitrarily
long. The combination of Policy Gradient and LSTM has shown
excellent results in many modern DRL applications [40], [41].
In the eco-driving problem, there are trajectory constraints
while approaching traffic lights and stop signs. Thus, in these
situations, the ego vehicle needs to remember the states visited
in the recent past. LSTM based function approximators are
therefore chosen to approximate the value function and the
advantage function so that the ego vehicle can use information
about the past states visited to decide on its torques.

IV. PROBLEM FORMULATION


In the eco-driving problem, the objective is to minimize the
weighted sum of fuel consumption and travel time between two
designated locations. The optimal control problem (OCP) is conditions. The set Sred represents the set in which the traffic
formulated as follows: light at a certain location is in the red phase.
∞
As the controller can only accurately predict the future driving
min E (ṁfuel,t + ctime ) Δt · I [dt < dtotal ] (22a) condition in a relatively short-range due to the limited connectiv-
Teng ,Tbsg ,Tbrk
t=0 ity range and onboard processing power, the stochastic optimal
s.t. SoCt+1 = fbatt (vveh,t , SoCt , Teng,t , Tbsg,t , Tbrk,t ) (22b) control formulation is deployed to accommodate the future
uncertainties. Specifically, since the surrounding vehicles are not
vveh,t+1 = fveh (vveh,t , SoCt , Teng,t , Tbsg,t , Tbrk,t ) (22c) considered in the study, the main source of uncertainty comes
min max from the unknown SPaT and the road conditions such as the
Teng (ωeng,t ) ≤ Teng,t ≤ Teng (ωeng,t ) (22d)
speed limits and the distance between signalized intersections
min max
Tbsg (ωbsg,t ) ≤ Tbsg,t ≤ Tbsg (ωbsg,t ) (22e) beyond the connectivity range.
max
0 ≤ Tbrk,t ≤ Tbrk (22f)
V. DEEP REINFORCEMENT LEARNING CONTROLLER
I min ≤ It ≤ I max (22g)
A. POMDP Adoption
{[I(Teng,t > 0) OR I(Tbsg,t > 0)] AND I(Tbrk,t > 0)} = 0
(22h) In this study, the eco-driving problem described by (22) is
solved as a POMDP. The constraints on the action space, i.e.
SoC min ≤ SoCt ≤ SoC max (22i) (22d), (22e) and (22g), are handled implicitly by a saturation
function in the environment model, whereas the constraints on
SoCT ≥ SoCF (22j)
the state space are handled by imposing numerical penalties
0 ≤ vveh,t ≤ vlim,t (22k) during the offline training.
Table I lists the observation and action spaces used to
(t, dt ) ∈
/ Sred (22l)
approach the eco-driving POMDP. Here, SoC and vveh are the

Here, ṁfuel,t is the instantaneous fuel consumption at time t, states measured by the onboard sensors, and vlim , vlim , dtlc ,

ctime is a constant to penalize the travel time took at each step, dlim and drem are assumed to be provided by the downloaded
fbatt and fveh are the battery and vehicle dynamics introduced in map and GPS. ts and te are the SPaT signals provided by V2I
Section II-A. The problem is formulated as an infinite horizon communication. When the upcoming traffic light is in the green
problem in which the stage cost becomes zero once the system phase, ts remains 0, and te is the remaining time of the current
reaches the goal, i.e. the traveled distance dt is greater than or green phase; when the upcoming traffic light is in red phase, ts is
equal to the total distance of the trip dtotal . (22d) to (22h) are the remaining time of the current red phase, and te is the sum of
the constraints imposed by the powertrain components. Specif- the remaining red phase and the duration of the upcoming green
ically, (22h) represents that the torques from the powertrain phase.
and the brake torque cannot be positive at the same time. (22i) Since the actor-critic algorithm is used in the article, the action
and (22j) are the constraints on the instantaneous battery SoC space is continuous. The constraints in (22d) to (22f) are incor-
and terminal SoC for charge sustaining. Here, the subscript T porated by as satuation in the environment. Since the violation of
represents the time at which the vehicle reaches the destination. the constraint in (22h) leads to the suboptimal strategy, i.e., the
SoCmin , SoCmax and SoCF are commonly set to 30%, 80% and torques generated from the powertrain are wasted by the positive
50%. (22k) and (22l) are the constraints imposed by the traffic brake torque, no explicit constraint is applied w.r.t (22h).

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1719

distance range, mtfc is determined by the current status of the


upcoming traffic light, the distance between the vehicle and the
intersection and the maximal distance that the vehicle could
drive given the remaining green phase dmax,t defined as follows:
te
dmax,t = [min (vlim,t , vveh,t + iamax )] , (29)
i=0

where amax is the maximal acceleration.


If the upcoming traffic light is in green, i.e. ts = 0, mtfc
gets updated to 1 if the distance between the vehicle and the
upcoming intersection is less than the dmax,t and 2 otherwise.
Intuitively, mtfc = 1 means that the vehicle has enough time
to cross the upcoming intersection within the remaining green
phase in the current traffic light programming cycle. In case the
vehicle following the actor policy was not able to catch the green
Fig. 5. State Machine for the Indicator mtfc . light, a penalty proportional to dmiss , the distance between the
vehicle at the last second of the green phase and the intersection,
The reward function r : S × A × S → R consists of four is assigned. On the other hand, mtfc = 2 means the vehicle would
terms: not reach the destination even with the highest acceleration in the
current cycle. If the upcoming traffic light is in red phase, mtfc,t
r = clip [robj + rvel + rbatt + rtfc , −1, 1] . (23) gets updated to 3. When the vehicle was not able to come to stop
and violate the constraints, a penalty proportional to the speed at
Here, robj represents the rewards associated with the OCP
which the vehicle passes in red is assigned. As a summary, the
objective; rvel is the penalty (negative reward) associated with
reward associated with the traffic light constraints is designed
the violation of the speed limit constraint (22k); rbatt represents
as follows:
the penalties associated with the violation of the battery con- 
straints imposed by (22i), (22j); rtfc is the penalties regarding ctfc,1 + ctfc,2 dmiss,t , mtfc,t = 1
the violation of the traffic light constraint imposed by (22l). rtfc,t = (30)
ctfc,1 + ctfc,3 vveh,t , otherwise
Specifically, the first three terms are designed as follows:
In Appendix A, a guideline to the design and the tuning of the re-
robj,t = cobj (ṁfuel,t [g/s] + ctime ) , (24) ward mechanism is provided, and the numerical values of all the
rvel,t = cvel,1 [vvel,t − vlim,t ]+ + cvel,2 ȧ2veh,t , (25) constants in the reward function are listed in Table III. In order to
⎛  determine the reward at any given time, the environment model
cbatt,1 [SoCt − SoC max ]+  requires states that are not directly available as observations such
⎜  +
rbatt,t = ⎝ + SoC min − SoCt , dt < dtotal (26) as ȧt , mtfc and dmiss . Instead of making these states available
+ to the control agent, the POMDP formulation is intentionally
cbatt,2 [SoCF − SoCt ] , dt ≥ dtotal
selected for two reasons. Firstly, mtfc is heuristically determined
where [·]+ is the positive part of the variable defined as and ignores the acceleration limit imposed by the powertrain
components. Since it poses a significant impact on the reward
[x]+ = max(0, x). (27) at the intersection, revealing mtfc to the controller results in a
In (25), a penalty to the longitudinal jerk is assigned to the strong dependency of the strategy to it, and occasionally such
agent to improve the drive quality and to avoid unnecessary a dependency misleads the agent to violate the constraints.
speed fluctuations. Secondly, dmiss is only relevant when the vehicle is catching
While the design of the first three rewards is straightforward, a green light within the critical braking distance. Its numerical
the reward associated with the traffic light constraints is more value in other situations could potentially mislead the policy, in
convoluted to define. First, a discrete state variable mtfc is de- the form of neural networks.
fined in Fig. 5. mtfc = 0 whenever the distance to the upcoming Studies have suggested that clipping the rewards between
traffic light is greater than the critical braking distance dcritical,t , [−1, 1] results in a better overall training process [42], [43]. With
which is defined as follows: the coefficients listed in Table III, the negative reward saturates
2 at −1 when dmiss,t > 75 m and mtfc,t = 1 or vveh,t > 7.5 m/s
vveh,t
dcritical,t = , (28) and mtfc,1 = 2, which means that the rewarding mechanism
2bmax would no longer differentiate the quality of the state-action pairs
where bmax is the maximal deceleration of the vehicle. Intuitively, beyond these thresholds at the signalized intersection. Such a
the agent does not need to make an immediate decision regarding design, on one hand, reduces the strong impact of the heuris-
whether to accelerate to pass or to decelerate to stop at the tically designed mtfc . On the other hand, it also significantly
upcoming signalized intersection outside the critical braking slows down, or in some cases, prevents any learning as the
distance range. Once the vehicle is within the critical braking rewards carry little directional information. Heess et al. [28]

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1720 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024

proposes to use a more diversified and rich environment to


overcome the issue. In this study, the diversity of the environ-
ment is ensured by the size of the SUMO map and the 10,000
randomly generated trips. In addition, the vehicle speed vveh
and battery SoC are randomly assigned following U (0, vlim )
and U (SoC min , SoC max ), respectively, at every T = 100 time
steps. This domain randomization mechanism, used in many Fig. 6. Network Architecture.
other DRL applications [41], [44], forces the agent to explore
the state space more efficiently and learn a more robust policy.
Algorithm 1: PPO for Eco-Driving.
B. Algorithm Details
With the Monte Carlo method, the value function can be
approximated as follows:
∞
V̂ξπθ (st ) ← Êπθ γ i−t r (si , ai ) |st , (31)
i=t

where the superscript πθ indicates the value function is associ-


ated with the policy π parameterized by θ, and the subscript
ξ indicates the value function itself is parameterized by ξ.
Although being unbiased, the Monte Carlo estimator is of high
variance and requires the entire trajectory to be simulated. On
the other hand, TD(N ) estimator is defined as follows:
t+N
V̂ξπθ (st ) ← Êπθ γ i−t r (si , ai ) + V̂ξπoldθ (st+N )|st . (32)
i=t

Compared to the Monte Carlo method, it reduces the required


rollout length and the variance of the estimation by bootstrap-
ping. However, TD(N) estimator is biased due to the approx-
imation error in V π (st+N ). TD(λ) included in [30] takes the and the policy update rule in (21), the policy tends to converge to
geometric sum of the terms from TD(N), leading to an adjustable a suboptimal deterministic policy prematurely since a sequence
balance between bias and variance. In this study, LSTM is of actions is required to change the intersection-crossing behav-
used as the function aproximators, instead of the data tuple ior due to its hierarchical nature. Studies [46], [47] show adding
(st , at , rt , st+1 ), (ot , ho,t , at , ha,t , rt , ot+1 , ho,t+1 ) are logged the entropy of the stochastic policy to the surrogate objective
in simulation, where ho and ha are the hidden states of the policy function effectively prevents such premature convergence. As
and value function networks, respectively. Since the state space a result, the policy is updated by maximizing the following
is randomized at every N steps, truncated TD(λ) is used for value objective function:
function approximation. Specifically, after having collected the
a sequence of tuples (ot , ho,t , at , ha,t , rt , ot+1 , ho,t+1 )t0 :t0 +N ,
Lmod,t (θ) = Lt (θ) + βh(N (μ, ±)). (36)
the following equations are used for updating the value function
∀t ∈ [t0 , t0 + N − 1]:
Here, Lt (θ) is the surrogate objective function defined in (21).
t0 +N −1
β and h(N (μ, ±)) are the entropy coefficient and the entropy
V̂ξπθ (ot , ho,t ) ← Vξπθ (ot , ho,t ) + (γλV )i−t δi , (33) of the multivariate Gaussian policy.
i=t Since PPO is on-policy, its sample efficiency is low. To
where δi = ri + γVξπθ (oi+1 , ho,i+1 ) − Vξπθ (oi , ho,i ). (34) accelerate the learning progress, multiple actors are distributed
over different CPU processors for sample collection. Algorithm
Similarly, to balance the variance and the bias, the advan-
1 lists the detailed steps of the algorithm, and Appendix B lists
tage function estimation is estimated with truncated GAE(λ) as
all the hyperparameters used for the training.
proposed in [45]:
t0 +N −1
πθ
 (ot , ho,t , at ) = (γλA )i−t δi . (35) VI. METHODS FOR BENCHMARKING
i=t To benchmark the fuel economy benefits of CAV technolo-
Fig. 6 shows the architectures of the neural network function gies, it is crucial to establish a baseline representative of real-
approximator for the value function and policy. Here, multivari- world driving. In this work, the performance of the DRL agent
ate Gaussian distribution with diagonal covariance matrix is used is benchmarked against another two real-time control strategies
as the stochastic policy. With the estimated advantage function and the wait-and-see solution.

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1721

A. Baseline Controller
The baseline controller consists of the Enhanced Driver Model
(EDM), a deterministic reference velocity predictor that uti-
lizes route features to generate velocity profiles representing
varying driving styles [48], [49], and a rule-based HEV energy
management controller. The baseline strategy passes signalized
intersections based on line-of-sight (LoS), a dynamic human-
vision based distance parameter used to preview the upcoming
route feature as devised by the Intersection Sight Distance (ISD)
specified by the American Association of State Highway and
Transportation Officials (AASHTO) and US DoT FHA [50].

B. Online Optimal Controller


In the previous work [13], a hierarchical MPC is formulated
to co-optimize the velocity and powertrain controls with an aim
to minimize energy in a 48 V mild-HEV using Approximate Dy-
namic Programming (ADP). The controller solves a long-term
optimization at the beginning of the trip that evaluates a base
policy using limited full-route information such as speed limits
and positions of route markers such as stop signs, traffic lights. Fig. 7. Evolution of Policy Entropy, Cumulative Rewards, Fuel Economy,
To account for variability in route conditions and/or uncertainty Average Speed and Complete Ratio during Training.
in route information, a short-term optimization is solved peri-
odically over a shorter horizon using the approximated terminal
cost from the long-term optimization. Time-varying SPaT in-
formation is accounted for by developing a heuristic controller VII. RESULTS
that uses distance to the traffic light, current vehicle velocity to The training takes place on a node in Ohio Supercomputer
kinematically reshape the speed limit such that the vehicle can Center (OSC) [53] with 40 Dual Intel Xeon 6148 s processors
pass at a green light. The formulation and details of the algorithm and an NVIDIA Volta V100 GPU. The results shown below
are referred to [13]. A detailed analysis is done to demonstrate requires running the training continuously for 24 hours.
the real-time implementation of the developed controller in [14] As domain randomization is activated during training, the
at a test-track in Columbus, OH. performance of the agent needs to be evaluated separately from
the training to show the learning progress. Here, 40 trips in the
training set with domain randomization deactivated are executed
C. Deterministic Optimal Solution
for every 10 policy updates, i.e. every 4,000 rollout episodes. In
The baseline controller, the online optimal controller, and the Fig. 7, the blue curves show the evolution of the mean policy en-
DRL agent can only preview the road information for the future tropy, the average cumulative rewards, the average fuel economy,
200 m, which leaves the rest of the route information stochastic. the average speed and the complete ratio over the 40 randomly
To show the sub-optimality from the control strategy and selected trips, and the red curves indicate the running average
the stochastic nature of the problem, the wait-and-see solution, over 10 evaluation points. Here, a trip is considered completed
assuming the road information and SPaT sequence of the traffic if the agent did not violate any constraint during the itinerary.
lights over the entire routes are known a priori are computed via At the beginning of the training, the policy entropy increases to
Deterministic Dynamic Programming (DDP) [51]. encourage exploration, and as the training evolves, the policy
To solve the problem via DDP, the state and action spaces are entropy eventually decreases as the entropy coefficient β is
discretized, and the optimal cost-to-go matrix and the optimal linearly annealing. On average, the agent reaches a performance
control policy matrix are obtained from backward recursion [9]. with an average fuel economy of 41.0 mpg (miles per gallon) and
Since each combination in the state-action space needs to be an average speed of ∼ 12.7 m/s. The agent was able to learn to
evaluated to get the optimal solution, the calculation of the wait- obey the traffic rules at signalized intersections within the oper-
to-see solution for any trip used in the study can take hours ation design domain thanks to the properly designed rewarding
with modern Central Processing Units (CPUs). Here, the parallel mechanism. Although the use of negative rewards leads to the
DDP solver in [52] with CUDA programming is used to reduce lack of safety guarantees, this does not pose a concern for an im-
the computation time from hours to seconds. The formulation plementation since the handling of the long-tailed corner cases
and the details of the algorithm are referred to [52]. Nevertheless, is usually deferred to a downstream controller module, see for
as the method requires the entire trip information a priori and instance [6]. In addition, studies [54], [55] show that safety filters
intense computation, the wait-and-see solution can only serve can be deployed on top of the superior planner to provide a safety
as a benchmark instead of a real-time decision-making module. guarantee.

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1722 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024

Fig. 8. Fuel Economy, Travel Time Comparison and Charge Sustenance


behavior for Baseline, DRL and Wait-and-see Solutions.

TABLE II
FUEL ECONOMY, AVERAGE SPEED AND SOC VARIANCE FOR BASELINE, ADP, Fig. 9. Variation of Average Speed and Fuel Economy against Traffic Light
DRL AND WAIT-AND-SEE SOLUTIONS (TL) Density for Baseline, DRL and Wait-and-see Solutions.

The performance of the DRL controller is then compared


against the causal baseline and ADP controllers and the non-
causal wait-and-see deterministic optimal solution among the
100 testing trips shown in Fig. 4. Fig. 8 and Table II show the
statistical comparison among the four strategies. Here, the black
line in each box represents the median of the data, and the lower
and upper boundaries represents the 25th and 75th percentiles,
respectively. The extremes of the lines extending outside the
box represent the minimum and maximum limit of the data
and the “+” symbol represent the outliers. With the comparable
travel time, the DRL controller consumes 17.5% less fuel over
all trips compared to the baseline strategy. The benefits come
from the more efficient use of the HEV powertrain by more
energy recuperation into the battery during brake, the less use of
the mechanical brake, and less unnecessary acceleration due to
the presence of traffic lights. These behaviors are demonstrated
later in Figs. 10 and 11. Compared to the ADP controller, the Fig. 10. Comparison of Velocity, ΔSoC, Time-Space and Fuel Consumption
DRL consumes 3.7% less fuel, while being ∼ 1 m/s slower. for Baseline, DRL and Wait-and-see Solutions [Urban Route].
Meanwhile, considering the ADP controller requires solving
the full-route optimization via DDP before departure and a
trajectory optimization at every timestep in real-time, the DRL light density increases. Compared to the causal controllers, the
strategy, which requires only the forward evaluation of the policy fuel economy of the wait-and-see solution is not affected by the
network, is more computationally tractable. increase in traffic light density. This is because the additional
The wait-and-see solution provides a dominant performance SPaT information from the subsequent traffic lights allows the
over the causal controllers. The additional benefits of the wait- wait-and-see solution to plan through the intersections more
and-see solution stem from the fact that the wait-and-see solution efficiently. In the meantime, as suggested by the regression
has the information of all the traffic lights over the entire trip, curves, the DRL controller is less sensitive to the increase in
whereas the causal controllers only use the SPaT of the upcom- traffic light density than the baseline and ADP controllers.
ing traffic light within 200 m. Such advantage is also reflected Figs. 10 and 11 show the trajectories of the three controllers
in Fig. 9. Here, the average speed and the MPG of each trip are in urban and mixed (urban and highway) driving conditions,
plotted against the traffic light density, i.e. the number of traffic respectively. Driving under the same condition, the DRL con-
lights divided by the total distance in kilometers. Intuitively, the troller was able to come to a full stop at signalized intersections
average speed of the three controllers decreases as the traffic less frequently compared to the baseline. In addition, the DRL

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1723

TABLE III
NUMERICAL VALUES OF THE CONSTANTS IN REWARD FUNCTION

lead vehicle will be considered in the framework of an ecological


ACC, which can be accomplished by expanding the state space
and rewarding mechanism.

APPENDIX A
REWARD FUNCTION DESIGN
In general, the design of the reward function is iterative. To
get the desired behavior from the trained policy, the numerical
constants in the reward function typically requires tuning by
humans. Here, the numerical constants are listed Table III. Some
key takeaways are listed below.
1) Normalize the scale of the reward function such that the
numerical value is between [−1, 1].
Fig. 11. Comparison of Velocity, ΔSoC, Time-Space and Fuel Consumption 2) The reward items rvel , rbatt and rtfc are associated with
for Baseline, DRL and Wait-and-see Solutions [Mixed Route].
the constraints, and they should be at least one order of
magnitude higher than robj .
3) Rewards from the environment should reflect incremental
controller utilizes more of the battery capacity, i.e. a SoC profile incentives/penalties. For example, rewards associated with
with higher variation, compared to the baseline. DRL’s efficient traffic lights are sparse, meaning that the agent receives
maneuvers while approaching the intersection coupled with these rewards periodically and needs a sequence of ac-
better utilization of SoC results in up to 27% reduction in fuel tions to avoid the large penalty. Equation (30) ensures the
consumption for both urban and mixed driving when compared penalty for violating the traffic condition is proportional
against the baseline. to how bad was the violation.
4) Penalties related to constraints should be orders larger than
VIII. CONCLUSION those related to performance.
In this study, the eco-driving problem for HEVs with the
capability of autonomously passing signalized intersections is
formulated as a POMDP. To accommodate the complexity APPENDIX B
HYPERPARAMETERS
and high computational requirement of solving this problem,
a learn-offline, execute-online strategy is proposed. To facili- Two sets of hyperparameters have been obtained in this study.
tate the training, a simulation environment was created con- The first set contains those in the reward function, and the second
sisting of a mild HEV powertrain model and a large-scale contains those for the PPO algorithm.
microscopic traffic simulator developed in SUMO. The DRL The values in the first set are included in Table III. In finding
controller is trained via PPO with LSTM as the function these values, the behavior of the agent was found to vary with
approximators. The performance of the DRL controller are ctime , as it governs the relative rewards between fuel consumption
benchmarked against a baseline strategy, a deterministic MPC and travel time. For fair comparison against the other methods,
strategy and a wait-and-see (optimal) solution. With the prop- the parameter was tuned to match the average speed while
erly designed rewarding mechanism, the agent learned to obey comparing fuel consumption. As for the reward coefficients
the constraints in the optimal control problem formulation. related to the constraint penalties, the final performance does
Furthermore, the learned explicit policy reduces the average not appear sensitive as long as the penalties are large enough to
fuel consumption by 17.5% over 100 randomly generated trips dominate the positive rewards gained by ignoring the constraints
in urban, mixed-urban and highway conditions when com- while remaining numerically stable.
pared to the baseline strategy, while keeping the travel time The hyperparameters used for training are listed in Table IV.
comparable. The final performance was not found sensitive to this set of
Future work will focus on three aspects. First, the hard hyperparameters. The robustness here is most likely due to the
constraint satisfaction will be rigorously analyzed. Then, the fact that PPO is arguably one of the most stable model-free
design of a reward function specific to the individual driver DRL algorithms, and that the parallel simulation environment
(personalization) will be investigated. Finally, the presence of a provides a large amount of results efficiency.

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1724 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024

TABLE IV [18] G. Guo, Z. Zhao, and R. Zhang, “Distributed trajectory optimization and
NUMERICAL VALUES OF THE CONSTANTS IN REWARD FUNCTION fixed-time tracking control of a group of connected vehicles,” IEEE Trans.
Veh. Technol., vol. 72, no. 2, pp. 1478–1487, Feb. 2023.
[19] F. Mensing, E. Bideaux, R. Trigui, and H. Tattegrain, “Trajectory opti-
mization for eco-driving taking into account traffic constraints,” Transp.
Res. Part D: Transport Environ., vol. 18, pp. 55–61, 2013.
[20] J. Shi, F. Qiao, Q. Li, L. Yu, and Y. Hu, “Application and evaluation of
the reinforcement learning approach to eco-driving at intersections under
infrastructure-to-vehicle communications,” Transp. Res. Rec., vol. 2672,
no. 25, pp. 89–98, 2018.
[21] G. Li and D. Görges, “Ecological adaptive cruise control for vehicles
with step-gear transmission based on reinforcement learning,” IEEE Trans.
Intell. Transp. Syst., vol. 21, no. 11, pp. 4895–4905, Nov. 2020.
[22] A. Pozzi, S. Bae, Y. Choi, F. Borrelli, D. M. Raimondo, and S. Moura,
“Ecological velocity planning through signalized intersections: A deep re-
inforcement learning approach,” in Proc. IEEE 59th Conf. Decis. Control,
2020, pp. 245–252.
[23] G. Mahler and A. Vahidi, “An optimal velocity-planning scheme for
vehicle energy efficiency through probabilistic prediction of traffic-signal
REFERENCES timing,” IEEE Trans. Intell. Transp. Syst., vol. 15, no. 6, pp. 2516–2523,
Dec. 2014.
[1] A. Vahidi and A. Sciarretta, “Energy saving potentials of connected [24] P. Rong and M. Pedram, “An analytical model for predicting the remaining
and automated vehicles,” Transp. Res. Part C: Emerg. Technol., vol. 95, battery capacity of lithium-ion batteries,” IEEE Trans. Very Large Scale
pp. 822–843, 2018. Integration Syst., vol. 14, no. 5, pp. 441–451, May 2006.
[2] A. Sciarretta, G. De Nunzio, and L. L. Ojeda, “Optimal ecodriving control: [25] M. Livshiz, M. Kao, and A. Will, “Validation and calibration process
Energy-efficient driving of road vehicles as an optimal control problem,” of powertrain model for engine torque control development,” SAE Tech.
IEEE Control Syst. Mag., vol. 35, no. 5, pp. 71–90, Oct. 2015. Paper 2004-01-0902, 2004.
[3] Q. Jin, G. Wu, K. Boriboonsomsin, and M. J. Barth, “Power-based optimal [26] P. A. Lopez et al., “Microscopic traffic simulation using SUMO,” in Proc.
longitudinal control for a connected eco-driving system,” IEEE Trans. IEEE 21st Int. Conf. Intell. Transp. Syst., 2018, pp. 2575–2582. [Online].
Intell. Transp. Syst., vol. 17, no. 10, pp. 2900–2910, Oct. 2016. Available: https://elib.dlr.de/124092/
[4] E. Ozatay et al., “Cloud-based velocity profile optimization for everyday [27] OpenStreetMap Contributors, “Planet dump retrieved from https://planet.
driving: A dynamic-programming-based solution,” IEEE Trans. Intell. osm.org,” 2017. [Online]. Available: https://www.openstreetmap.org
Transp. Syst., vol. 15, no. 6, pp. 2491–2505, Dec. 2014. [28] N. Heess et al., “Emergence of locomotion behaviours in rich environ-
[5] J. Han, A. Vahidi, and A. Sciarretta, “Fundamentals of energy efficient ments,” 2017, arXiv:1707.02286.
driving for combustion engine and electric vehicles: An optimal control [29] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive
perspective,” Automatica, vol. 103, pp. 558–572, 2019. elements that can solve difficult learning control problems,” IEEE Trans.
[6] C. Sun, J. Guanetti, F. Borrelli, and S. Moura, “Optimal eco-driving control Syst., Man, Cybern., vol. SMC-13, no. 5, pp. 834–846, Sep./Oct. 1983.
of connected and autonomous vehicles through signalized intersections,” [30] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
IEEE Internet Things J., vol. 7, no. 5, pp. 3759–3773, May 2020. Cambridge, MA, USA: MIT Press, 2018.
[7] F. Mensing, R. Trigui, and E. Bideaux, “Vehicle trajectory optimization [31] R. J. Williams, “Simple statistical gradient-following algorithms for
for hybrid vehicles taking into account battery state-of-charge,” in Proc. connectionist reinforcement learning,” Mach. Learn., vol. 8, no. 3/4,
IEEE Veh. Power Propulsion Conf., 2012, pp. 950–955. pp. 229–256, 1992.
[8] L. Guo, B. Gao, Y. Gao, and H. Chen, “Optimal energy management for [32] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of
HEVs in eco-driving applications using bi-level MPC,” IEEE Trans. Intell. Markov reward processes,” IEEE Trans. Autom. Control, vol. 46, no. 2,
Transp. Syst., vol. 18, no. 8, pp. 2153–2162, Aug. 2017. pp. 191–209, Feb. 2001.
[9] P. Olin et al., “Reducing fuel consumption by using information from [33] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-
connected and automated vehicle modules to optimize propulsion system dient methods for reinforcement learning with function approximation,”
control,” SAE Techn. Paper, Tech. Rep., 2019. in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
[10] S. Bae, Y. Choi, Y. Kim, J. Guanetti, F. Borrelli, and S. Moura, “Real- [34] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
time ecological velocity planning for plug-in hybrid vehicles with partial policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1889–
communication to traffic lights,” in Proc. IEEE 58th Conf. Decis. Control, 1897.
2019, pp. 1279–1285. [35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal
[11] D. Maamria, K. Gillet, G. Colin, Y. Chamaillard, and C. Nouillant, “Com- policy optimization algorithms,” 2017, arXiv:1707.06347.
putation of eco-driving cycles for hybrid electric vehicles: Comparative [36] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control
analysis,” Control Eng. Pract., vol. 71, pp. 44–52, 2018. with recurrent neural networks,” 2015, arXiv:1512.04455.
[12] B. Asadi and A. Vahidi, “Predictive cruise control: Utilizing upcoming [37] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber, “Solving deep
traffic signal information for improving fuel economy and reducing trip memory POMDPS with recurrent policy gradients,” in Proc. Int. Conf.
time,” IEEE Trans. Control Syst. Technol., vol. 19, no. 3, pp. 707–714, Artif. Neural Netw., 2007, pp. 697–706.
May 2011. [38] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[13] S. R. Deshpande, S. Gupta, A. Gupta, and M. Canova, “Real-time eco- Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
driving control in electrified connected and autonomous vehicles using ap- [39] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
proximate dynamic programming,” J. Dyn. Syst., Meas. Control, vol. 144, with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2,
2022, Art. no. 011111. pp. 157–166, Mar. 1994.
[14] S. R. Deshpande et al., “In-vehicle test results for advanced propulsion [40] O. Vinyals et al., “Grandmaster level in StarCraft II using multi-agent
and vehicle system controls using connected and automated vehicle infor- reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
mation,” SAE Tech. Paper 2021-01-0430, 2021. [41] C. Berner et al., “Dota 2 with large scale deep reinforcement learning,”
[15] G. Guo and Q. Wang, “Fuel-efficient en route speed planning and tracking 2019, arXiv:1912.06680,.
control of truck platoons,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. 8, [42] V. Mnih et al., “Human-level control through deep reinforcement learning,”
pp. 3091–3103, Aug. 2019. Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[16] G. Guo and D. Li, “PMP-based set-point optimization and sliding-mode [43] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, “Learning
control of vehicular platoons,” IEEE Trans. Computat. Social Syst., vol. 5, values across many orders of magnitude,” in Proc. Adv. Neural Inf. Process.
no. 2, pp. 553–562, Jun. 2018. Syst., 2016, pp. 4287–4295.
[17] G. Guo, D. Yang, and R. Zhang, “Distributed trajectory optimization and [44] Z. Zhu, Y. Liu, and M. Canova, “Energy management of hybrid eletric
platooning of vehicles to guarantee smooth traffic flow,” IEEE Trans. Intell. vehicles via deep Q networks,” in Proc. IEEE Amer. Control Conf., 2020,
Veh., vol. 8, no. 1, pp. 684–695, Jan. 2023. pp. 3077–3082.

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1725

[45] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- Shobhit Gupta (Member, IEEE) received the bach-
dimensional continuous control using generalized advantage estimation,” elor of technology degree in mechanical engineering
2015, arXiv:1506.02438. from the Indian Institute of Technology Guwahati,
[46] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” Guwahati, India, in 2017, the M.S. and Ph.D. de-
in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. grees in mechanical engineering from The Ohio State
[47] R. J. Williams and J. Peng, “Function optimization using connection- University, Columbus, OH, USA, in 2019 and 2022,
ist reinforcement learning algorithms,” Connection Sci., vol. 3, no. 3, respectively. He is currently a Propulsion Controls
pp. 241–268, 1991. Researcher with General Motors Global Research
[48] S. Gupta, S. R. Deshpande, P. Tulpule, M. Canova, and G. Rizzoni, “An and Development, Detroit, MI, USA. His research
enhanced driver model for evaluating fuel economy on real-world routes,” interests include AI/ML and optimal control applica-
IFAC-PapersOnLine, vol. 52, no. 5, pp. 574–579, 2019. ble to battery management systems, connected and
[49] S. Gupta et al., “Estimation of fuel economy on real-world routes for autonomous vehicles, and driver behavior recognition for predictive control.
next-generation connected and automated hybrid powertrains,” SAE Tech.
Paper 2020-01-0593, 2020.
[50] A. AASHTO, Policy on Geometric Design of Highways and Streets.
Washington, DC, USA: Amer. Assoc. State Highway. Transp. Officials,
2001. Abhishek Gupta (Member, IEEE) received the
[51] O. Sundstrom and L. Guzzella, “A generic dynamic programming MAT- B.Tech. degree in aerospace engineering from IIT
LAB function,” in Proc. IEEE Control Appl., Intell. Control, 2009, Bombay, Mumbai, India, in 2009, the M.S. degree in
pp. 1625–1630. applied mathematics from the University of Illinois at
[52] Z. Zhu, S. Gupta, N. Pivaro, S. R. Deshpande, and M. Canova, “A GPU Urbana-Champaign, Champaign, IL, USA, in 2012,
implementation of a look-ahead optimal controller for eco-driving based the M.S. and Ph.D. degrees in aerospace engineering
on dynamic programming,” in Proc. IEEE Eur. Control Conf., 2021, from UIUC, in 2014, and. He is currently an Asso-
pp. 899–904. ciate Professor with ECE Department, The Ohio State
[53] Ohio Supercomputer Center, “Ohio Supercomputer Center,” 1987. [On- University, Columbus, OH, USA. His research inter-
line]. Available: http://osc.edu/ark:/19495/f5s1ph73 ests include stochastic control theory, probability the-
[54] T. Phan-Minh et al., “Driving in real life with inverse reinforcement ory, game theory with applications to transportation
learning,” 2022, arXiv:2206.03004. markets, electricity markets, and cybersecurity of control systems.
[55] M. Vitelli et al., “SafetyNet: Safe planning for real-world self-driving
vehicles using machine-learned policies,” in Proc. IEEE Int. Conf. Robot.
Automat., 2022, pp. 897–904.

Marcello Canova (Member, IEEE) received the


Ph.D. degree in mechanical engineering from the
University of Parma, Parma, Italy. He is currently a
Professor of mechanical and aerospace engineering
with The Ohio State University, Columbus, OH, USA,
Zhaoxuan Zhu (Member, IEEE) received the B.Sc. and the Associate Director of the Center for Auto-
(summa cum laude), M.S., and Ph.D. degrees in me- motive Research. He is the author of more than 150
chanical engineering from The Ohio State University, journal and conference articles and five U.S. Patents.
Columbus, OH, USA, in 2016, 2018, and 2021, re- His research interests include energy optimization
spectively. From 2017 to 2021, he was a Graudate and management of ground vehicle propulsion sys-
Research Associate with the Center for Automotive tems, including internal combustion engines, hybrid-
Research. He is currently a Senior Engineer with electric drivetrains, energy storage systems, and thermal management. He was
Motional, Boston, MA, USA. His research interests the recipient of SAE Vincent Bendix Automotive Electronics Engineering Award
include optimal control, reinforcement learning, deep in 2009, SAE Ralph E. Teetor Educational Award in 2016, NSF CAREER Award
learning and their applications to connected, and au- in 2016, Lumley Research Award in 2012, 2016, and 2020, and the Michael J.
tonomous vehicles. Moran Award for Excellence in Teaching in 2017.

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.

You might also like