A Deep Reinforcement Learning Framework for Eco-Driving in Connected and Automated Hybrid Electric Vehicles
A Deep Reinforcement Learning Framework for Eco-Driving in Connected and Automated Hybrid Electric Vehicles
Abstract—Connected and Automated Vehicles (CAVs), in par- have the potential to operate in a more time- and fuel-efficient
ticular those with multiple power sources, have the potential to manner [1]. With Vehicle-to-Vehicle (V2V) and Vehicle-to-
significantly reduce fuel consumption and travel time in real-world Infrastructure (V2I) communication, the controller has access to
driving conditions. In particular, the eco-driving problem seeks
to design optimal speed and power usage profiles based upon real-time look-ahead information including the terrain, infras-
look-ahead information from connectivity and advanced mapping tructure and surrounding vehicles. Intuitively, with connectivity
features, to minimize the fuel consumption over a given itinerary. technologies, controllers can plan a speed profile that allows the
In this work, the eco-driving problem is formulated as a Partially ego vehicle to intelligently pass more signalized intersections in
Observable Markov Decision Process (POMDP), which is then
green phases with less change in speed. This problem is formu-
solved with a state-of-art Deep Reinforcement Learning (DRL) Ac-
tor Critic algorithm, Proximal Policy Optimization. An eco-driving lated as the eco-driving problem (incorporating Eco-Approach
simulation environment is developed for training and evaluation and Departure at signalized intersections), which aims to min-
purposes. To benchmark the performance of the DRL controller, imize the fuel consumption and the travel time between two
a baseline controller representing the human driver, a trajectory designated locations by co-optimizing the speed trajectory and
optimization algorithm and the wait-and-see deterministic optimal the powertrain control strategy [2], [3].
solution are presented. With a minimal onboard computational
requirement and a comparable travel time, the DRL controller The literature related to the eco-driving problem distinguishes
reduces the fuel consumption by more than 17% compared against among two aspects, namely, powertrain configurations and traf-
the baseline controller by modulating the vehicle velocity over the fic scenarios. Regarding powertrain configuration, the difference
route and performing energy-efficient approach and departure at is in whether the powertrain is equipped with a single power
signalized intersections, over-performing the more computation- source [3], [4], [5], [6] or a hybrid electric architecture [7], [8],
ally demanding trajectory optimization method.
[9], [10]. The latter involves modeling multiple power sources
Index Terms—Connected and automated vehicle, eco-driving, and devising optimal control algorithms that can synergistically
deep reinforcement learning, dynamic programming, long short- split the power demand to efficiently utilize the electric energy
term memory.
stored in the battery. Maamria et al. [11] systematically compare
the computational requirement and the optimality of different
I. INTRODUCTION eco-driving formulations solved offline via Deterministic Dy-
namic Programming (DDP).
ITH the advancement in the vehicular connectivity and
W autonomy, Connected and Automated Vehicles (CAVs)
Related to the traffic scenarios, Ozatay et al. [4] proposed a
framework providing advisory speed profile using online opti-
mization conducted on a cloud-based server without considering
Manuscript received 31 December 2022; revised 29 August 2023; accepted the real-time traffic light variability. Olin et al. [9] implemented
10 September 2023. Date of publication 22 September 2023; date of current
version 13 February 2024. This work was supported in part by the United States the eco-driving framework to evaluate real-world fuel economy
Department of Energy, Advanced Research Projects Agency–Energy (ARPA- benefits obtained from a control logic computed in a Rapid
E) NEXTCAR Project under Grant DE-AR0000794 and in part by The Ohio Prototyping System on-board a test vehicle. As traffic lights are
Supercomputer Center. The review of this article was coordinated by Dr. Cailian
Chen. (Corresponding author: Zhaoxuan Zhu.) not explicitly considered in these studies, the eco-driving control
Zhaoxuan Zhu was with the Center for Automotive Research, The Ohio State module is required to be coupled with other decision-making
University, Columbus, OH 43212 USA. He is now with the Motional, Boston, agents, such as human drivers or Adaptive Cruise Control (ACC)
MA 02210 USA (e-mail: [email protected]).
Shobhit Gupta was with the Center for Automotive Research, The Ohio State systems. Other studies have explicitly modeled and considered
University, Columbus, OH 43212 USA. He is now with the General Motors Signal Phase and Timings (SPaTs). Jin et al. [3] formulated the
Research and Development, Warren, MI 48092 USA (e-mail: gupta.852@ problem as a Mixed Integer Linear Programming (MILP) for
buckeyemail.osu.edu).
Abhishek Gupta is with the Department of Electrical and Computer Engineer- conventional vehicles with Internal Combustion Engine (ICE).
ing, The Ohio State University, Columbus, OH 43210 USA (e-mail: gupta.706@ Asadi et al. [12] used traffic simulation models and proposed to
osu.edu). solve the problem considering probabilistic SPaT with DDP. Sun
Marcello Canova is with the Center for Automotive Research, The Ohio State
University, Columbus, OH 43212 USA (e-mail: [email protected]). et al. [6] formulated the eco-driving problem as a distributionally
Digital Object Identifier 10.1109/TVT.2023.3318552 robust stochastic optimization problem with collected real-world
0018-9545 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1714 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1715
where τbelt , ωbsg,t and Tbsg,t refer to the belt ratio, the BSG
Fig. 2. Block Diagram of 48 V P0 Mild-Hybrid Drivetrain. angular velocity and the BSG torque, respectively.
3) Battery Model: A zero-th order equivalent circuit model is
vehicle and its location along the trip. While the states of vehicle used to model the current (It ) dynamics. Coulomb counting [24]
and powertrain such as battery State-of-Charge (SoC), velocity is used to compute the battery SoC:
and gear are readily available to the powertrain controller, the VOC (SoCt ) − VOC 2 (SoC ) − 4R (SoC )P
t 0 t bsg,t
availability of the connectivity information depends on the in- It = ,
2R0 (SoCt )
frastructure and the types of sensors equipped onboard. In this (4)
study, it is assumed that Dedicated Short Range Communication
(DSRC) [12] sensors are available onboard, and SPaT becomes Δt
SoCt+1 = SoCt − (It + Ia ), (5)
available and remains accurate once it enters the 200 m range. Cnom
The uncertainties caused by sensor unavailability and inaccuracy where Δt is the time discretization, which is set to 1 s in this
in SPaT, as studied in [6], [23] in SPaT, is not considered in the study. The power consumed by the auxiliaries is modeled by a
simulation model or in the study. While adding uncertainties in calibrated constant current bias Ia . The cell open circuit voltage
the traffic model is left as future work, such uncertainties can be VOC (SoCt ) and internal resistance R0 (SoCt ) data are obtained
ingested by the Markov Decision Process (MDP) formulation. from the pack supplier.
Thus, the model-free DRL problem formulation is expected 4) Torque Converter Model: A simplified torque converter
to remain the same. The DRL agent utilizes the SPaT from model is developed to compute the losses during traction and
the upcoming traffic light while ignoring the SPaT from any regeneration modes. Here, the lock-up clutch is assumed to be
other traffic light regardless of the availability. Specifically, always actuated, applying a controlled slip ωslip between the
the distance to the upcoming traffic light, its status and SPaT turbine and the pump. The assumption might be inaccurate
program are fed into the controller as observations. Finally, a during launches, and this can be compensated by including a
navigation application with Global Positioning System (GPS) is fuel consumption penalty in the optimization problem, associ-
assumed to be on the vehicle such that the locations of the origin ated to each vehicle launch event. This model is described as
and the destination, the remaining distance, the speed limits of follows [25]:
the entire trip are available at every point during the trip.
Ttc,t = Tpt,t , (6)
A. Vehicle and Powertrain Model ωp,t = ωtc,t + ωslip (ng,t , ωeng,t , Teng,t ) , (7)
A forward-looking dynamic powertrain model is developed ⎧
⎪
⎨ωp,t , ωp,t ≥ ωstall
for fuel economy evaluation and control strategy verification
ωeng,t = ωidle,t , 0 ≤ ωp,t < ωstall (8)
over real-world routes. In this work, a P0 mild-hybrid electric ⎪
⎩
vehicle (mHEV) is considered, equipped with a 48 V Belted 0, 0 ≤ ωp,t < ωstall and Stop = 1
Starter Generator (BSG) performing torque assist, regenerative
where ng is the gear number, ωp,t is the speed of the torque
braking and start-stop functions. The diagram of the powertrain
converter pump, ωtc,t is the speed of the turbine, ωstall is the speed
is illustrated in Fig. 2. The key components of the low-frequency
at which the engine stalls, ωidle,t is the idle speed of the engine,
quasi-static model are described below.
stop is a flag from the ECU indicating engine shut-off when the
1) Engine Model: The engine is modeled as low-frequency
vehicle is stationary, Ttc,t is the turbine torque, and Tpt,t is the
quasi-static nonlinear maps. The fuel consumption and the
combined powertrain torque. The desired slip ωslip is determined
torque limit maps are based on steady-state engine test bench
based on the powertrain conditions and desired operating mode
data provided by a supplier:
of the engine (traction or deceleration fuel cut-off).
ṁfuel,t = ψ (ωeng,t , Teng,t ) , (1) 5) Transmission Model: The transmission model is based on
a static gearbox, whose equations are as follows:
where the second subscript t represents the discrete time index,
ωtc,t = τg (ng,t )ωtrans,t
and ωeng and Teng are the engine angular velocity and torque,
respectively. = τg (ng,t )τfdr ωout,t
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1716 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024
Fig. 3. Validation of Vehicle Velocity, SoC and Fuel Consumed over FTP
Cycle.
Fig. 4. Map of Columbus, OH for DRL Training [Each red and blue marker
denotes the start and end point of an individual trip and the colored line denote
vveh,t the route between these points.].
= τg (ng,t )τfdr , (9)
Rw
Ttrans,t = τg (ng,t )Ttc,t , (10) where the battery SoC and fuel consumption are compared
⎧ against experimental data.
⎨τfdr ηtrans (ng,t , Ttrans,t , ωtrans,t )Ttrans,t , Ttrans,t ≥ 0 The mismatches in the battery SoC profiles can be attributed
Tout,t = τfdr to the simplicity of the battery model, in which the electrical
⎩ Ttrans,t , Ttrans,t < 0
ηtrans (ng,t , Ttrans,t , ωtrans,t ) accessory loads are modeled using a constant current bias. The
(11) fuel consumption over the FTP cycle is well estimated by the
model, with an error on the final value less than 4% relative to
where τg and τfdr are the gear ratio and the final drive ratio,
the experimental data.
respectively. The transmission efficiency ηtrans (ng , Ttrans , ωtrans )
is scheduled as a nonlinear map expressed as a function of gear
number ng , transmission input shaft torque Ttrans,t and transmis- B. Traffic Model
sion input speed ωtrans . ωout,t refers to the angular velocity of the A large-scale microscopic traffic simulator is developed
wheels. Rw and vveh,t are the radius of the vehicle wheel and the in the open source software Simulation of Urban Mobility
longitudinal velocity of the vehicle, respectively. (SUMO) [26] as part of the environment. To recreate realistic
6) Vehicle Longitudinal Dynamics Model: The vehicle dy- mixed urban and highway trips for training, the map of the city
namics model is based on the road-load equation, which indi- of Columbus, OH, USA is downloaded from the online database
cates the tire rolling resistance, road grade, and aerodynamic OpenStreetMap [27]. The map contains the length, shape, type
drag: and speed limit of the road segments and the detailed program
of each traffic light in signalized intersections.
Tout,t − Tbrk,t 1 Cd ρa Af 2
aveh,t = − vveh,t Fig. 4 highlights the area (∼ 11 km by 11 km) covered in
M Rw 2 M the study. In the area, 10,000 random passenger car trips are
− gCr cos αt vveh,t − g sin αt . (12) generated as the training set, and the total distance of each trip
is randomly distributed from 5 km to 10 km. Another 100 trips,
Here, aveh,t is the longitudinal acceleration of the vehicle, Tbrk is with which the origins and the destinations are marked in red
the brake torque applied on wheel, M is the mass of the vehicle, and blue in Fig. 4, respectively, are generated following the same
Cd is the aerodynamic drag coefficient, ρa is the air density, Af distribution as the testing set. In addition, the inter-departure
is the effective aerodynamic frontal area, Cr is rolling resistance time of each trip follows a geometric distribution with the
coefficient, and αt is the road grade. success rate p = 0.01. The variation and the randomness of the
7) Vehicle Model Verification: The forward model is then trips used for training enhance the richness of the environment,
calibrated and verified using experimental data from chassis which subsequently leads to a learned policy that is less subject to
dynamometer testing. The key variables used for evaluating the local minima and agnostic to specific driving conditions (better
model are vehicle velocity, battery SoC, gear number, engine generalizability) [28].
speed, desired engine and BSG torque profiles, and fuel con- The interface between the traffic simulator and the VD&PT
sumption. Fig. 3 show sample results from model verification model is established via Traffic Control Interface (TraCI) as part
over the Federal Test Procedure (FTP) regulatory drive cycle, of the SUMO package. At any given time step, the kinetics of the
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1717
vehicle calculated from VD&PT is fed to the traffic simulator as where ρ(s) is the discounted on-policy state distribution defined
input. Subsequently, SUMO determines the location of the ego as follows:
vehicle, updates the connectivity information such as the SPaT ∞
of the upcoming traffic light and the GPS signal and returns them ρ(s) = γ t P (st = s). (18)
to the agent as part of the observations. t=0
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1718 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1719
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1720 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1721
A. Baseline Controller
The baseline controller consists of the Enhanced Driver Model
(EDM), a deterministic reference velocity predictor that uti-
lizes route features to generate velocity profiles representing
varying driving styles [48], [49], and a rule-based HEV energy
management controller. The baseline strategy passes signalized
intersections based on line-of-sight (LoS), a dynamic human-
vision based distance parameter used to preview the upcoming
route feature as devised by the Intersection Sight Distance (ISD)
specified by the American Association of State Highway and
Transportation Officials (AASHTO) and US DoT FHA [50].
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1722 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024
TABLE II
FUEL ECONOMY, AVERAGE SPEED AND SOC VARIANCE FOR BASELINE, ADP, Fig. 9. Variation of Average Speed and Fuel Economy against Traffic Light
DRL AND WAIT-AND-SEE SOLUTIONS (TL) Density for Baseline, DRL and Wait-and-see Solutions.
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1723
TABLE III
NUMERICAL VALUES OF THE CONSTANTS IN REWARD FUNCTION
APPENDIX A
REWARD FUNCTION DESIGN
In general, the design of the reward function is iterative. To
get the desired behavior from the trained policy, the numerical
constants in the reward function typically requires tuning by
humans. Here, the numerical constants are listed Table III. Some
key takeaways are listed below.
1) Normalize the scale of the reward function such that the
numerical value is between [−1, 1].
Fig. 11. Comparison of Velocity, ΔSoC, Time-Space and Fuel Consumption 2) The reward items rvel , rbatt and rtfc are associated with
for Baseline, DRL and Wait-and-see Solutions [Mixed Route].
the constraints, and they should be at least one order of
magnitude higher than robj .
3) Rewards from the environment should reflect incremental
controller utilizes more of the battery capacity, i.e. a SoC profile incentives/penalties. For example, rewards associated with
with higher variation, compared to the baseline. DRL’s efficient traffic lights are sparse, meaning that the agent receives
maneuvers while approaching the intersection coupled with these rewards periodically and needs a sequence of ac-
better utilization of SoC results in up to 27% reduction in fuel tions to avoid the large penalty. Equation (30) ensures the
consumption for both urban and mixed driving when compared penalty for violating the traffic condition is proportional
against the baseline. to how bad was the violation.
4) Penalties related to constraints should be orders larger than
VIII. CONCLUSION those related to performance.
In this study, the eco-driving problem for HEVs with the
capability of autonomously passing signalized intersections is
formulated as a POMDP. To accommodate the complexity APPENDIX B
HYPERPARAMETERS
and high computational requirement of solving this problem,
a learn-offline, execute-online strategy is proposed. To facili- Two sets of hyperparameters have been obtained in this study.
tate the training, a simulation environment was created con- The first set contains those in the reward function, and the second
sisting of a mild HEV powertrain model and a large-scale contains those for the PPO algorithm.
microscopic traffic simulator developed in SUMO. The DRL The values in the first set are included in Table III. In finding
controller is trained via PPO with LSTM as the function these values, the behavior of the agent was found to vary with
approximators. The performance of the DRL controller are ctime , as it governs the relative rewards between fuel consumption
benchmarked against a baseline strategy, a deterministic MPC and travel time. For fair comparison against the other methods,
strategy and a wait-and-see (optimal) solution. With the prop- the parameter was tuned to match the average speed while
erly designed rewarding mechanism, the agent learned to obey comparing fuel consumption. As for the reward coefficients
the constraints in the optimal control problem formulation. related to the constraint penalties, the final performance does
Furthermore, the learned explicit policy reduces the average not appear sensitive as long as the penalties are large enough to
fuel consumption by 17.5% over 100 randomly generated trips dominate the positive rewards gained by ignoring the constraints
in urban, mixed-urban and highway conditions when com- while remaining numerically stable.
pared to the baseline strategy, while keeping the travel time The hyperparameters used for training are listed in Table IV.
comparable. The final performance was not found sensitive to this set of
Future work will focus on three aspects. First, the hard hyperparameters. The robustness here is most likely due to the
constraint satisfaction will be rigorously analyzed. Then, the fact that PPO is arguably one of the most stable model-free
design of a reward function specific to the individual driver DRL algorithms, and that the parallel simulation environment
(personalization) will be investigated. Finally, the presence of a provides a large amount of results efficiency.
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
1724 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 2, FEBRUARY 2024
TABLE IV [18] G. Guo, Z. Zhao, and R. Zhang, “Distributed trajectory optimization and
NUMERICAL VALUES OF THE CONSTANTS IN REWARD FUNCTION fixed-time tracking control of a group of connected vehicles,” IEEE Trans.
Veh. Technol., vol. 72, no. 2, pp. 1478–1487, Feb. 2023.
[19] F. Mensing, E. Bideaux, R. Trigui, and H. Tattegrain, “Trajectory opti-
mization for eco-driving taking into account traffic constraints,” Transp.
Res. Part D: Transport Environ., vol. 18, pp. 55–61, 2013.
[20] J. Shi, F. Qiao, Q. Li, L. Yu, and Y. Hu, “Application and evaluation of
the reinforcement learning approach to eco-driving at intersections under
infrastructure-to-vehicle communications,” Transp. Res. Rec., vol. 2672,
no. 25, pp. 89–98, 2018.
[21] G. Li and D. Görges, “Ecological adaptive cruise control for vehicles
with step-gear transmission based on reinforcement learning,” IEEE Trans.
Intell. Transp. Syst., vol. 21, no. 11, pp. 4895–4905, Nov. 2020.
[22] A. Pozzi, S. Bae, Y. Choi, F. Borrelli, D. M. Raimondo, and S. Moura,
“Ecological velocity planning through signalized intersections: A deep re-
inforcement learning approach,” in Proc. IEEE 59th Conf. Decis. Control,
2020, pp. 245–252.
[23] G. Mahler and A. Vahidi, “An optimal velocity-planning scheme for
vehicle energy efficiency through probabilistic prediction of traffic-signal
REFERENCES timing,” IEEE Trans. Intell. Transp. Syst., vol. 15, no. 6, pp. 2516–2523,
Dec. 2014.
[1] A. Vahidi and A. Sciarretta, “Energy saving potentials of connected [24] P. Rong and M. Pedram, “An analytical model for predicting the remaining
and automated vehicles,” Transp. Res. Part C: Emerg. Technol., vol. 95, battery capacity of lithium-ion batteries,” IEEE Trans. Very Large Scale
pp. 822–843, 2018. Integration Syst., vol. 14, no. 5, pp. 441–451, May 2006.
[2] A. Sciarretta, G. De Nunzio, and L. L. Ojeda, “Optimal ecodriving control: [25] M. Livshiz, M. Kao, and A. Will, “Validation and calibration process
Energy-efficient driving of road vehicles as an optimal control problem,” of powertrain model for engine torque control development,” SAE Tech.
IEEE Control Syst. Mag., vol. 35, no. 5, pp. 71–90, Oct. 2015. Paper 2004-01-0902, 2004.
[3] Q. Jin, G. Wu, K. Boriboonsomsin, and M. J. Barth, “Power-based optimal [26] P. A. Lopez et al., “Microscopic traffic simulation using SUMO,” in Proc.
longitudinal control for a connected eco-driving system,” IEEE Trans. IEEE 21st Int. Conf. Intell. Transp. Syst., 2018, pp. 2575–2582. [Online].
Intell. Transp. Syst., vol. 17, no. 10, pp. 2900–2910, Oct. 2016. Available: https://elib.dlr.de/124092/
[4] E. Ozatay et al., “Cloud-based velocity profile optimization for everyday [27] OpenStreetMap Contributors, “Planet dump retrieved from https://planet.
driving: A dynamic-programming-based solution,” IEEE Trans. Intell. osm.org,” 2017. [Online]. Available: https://www.openstreetmap.org
Transp. Syst., vol. 15, no. 6, pp. 2491–2505, Dec. 2014. [28] N. Heess et al., “Emergence of locomotion behaviours in rich environ-
[5] J. Han, A. Vahidi, and A. Sciarretta, “Fundamentals of energy efficient ments,” 2017, arXiv:1707.02286.
driving for combustion engine and electric vehicles: An optimal control [29] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive
perspective,” Automatica, vol. 103, pp. 558–572, 2019. elements that can solve difficult learning control problems,” IEEE Trans.
[6] C. Sun, J. Guanetti, F. Borrelli, and S. Moura, “Optimal eco-driving control Syst., Man, Cybern., vol. SMC-13, no. 5, pp. 834–846, Sep./Oct. 1983.
of connected and autonomous vehicles through signalized intersections,” [30] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
IEEE Internet Things J., vol. 7, no. 5, pp. 3759–3773, May 2020. Cambridge, MA, USA: MIT Press, 2018.
[7] F. Mensing, R. Trigui, and E. Bideaux, “Vehicle trajectory optimization [31] R. J. Williams, “Simple statistical gradient-following algorithms for
for hybrid vehicles taking into account battery state-of-charge,” in Proc. connectionist reinforcement learning,” Mach. Learn., vol. 8, no. 3/4,
IEEE Veh. Power Propulsion Conf., 2012, pp. 950–955. pp. 229–256, 1992.
[8] L. Guo, B. Gao, Y. Gao, and H. Chen, “Optimal energy management for [32] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of
HEVs in eco-driving applications using bi-level MPC,” IEEE Trans. Intell. Markov reward processes,” IEEE Trans. Autom. Control, vol. 46, no. 2,
Transp. Syst., vol. 18, no. 8, pp. 2153–2162, Aug. 2017. pp. 191–209, Feb. 2001.
[9] P. Olin et al., “Reducing fuel consumption by using information from [33] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-
connected and automated vehicle modules to optimize propulsion system dient methods for reinforcement learning with function approximation,”
control,” SAE Techn. Paper, Tech. Rep., 2019. in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
[10] S. Bae, Y. Choi, Y. Kim, J. Guanetti, F. Borrelli, and S. Moura, “Real- [34] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
time ecological velocity planning for plug-in hybrid vehicles with partial policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1889–
communication to traffic lights,” in Proc. IEEE 58th Conf. Decis. Control, 1897.
2019, pp. 1279–1285. [35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal
[11] D. Maamria, K. Gillet, G. Colin, Y. Chamaillard, and C. Nouillant, “Com- policy optimization algorithms,” 2017, arXiv:1707.06347.
putation of eco-driving cycles for hybrid electric vehicles: Comparative [36] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control
analysis,” Control Eng. Pract., vol. 71, pp. 44–52, 2018. with recurrent neural networks,” 2015, arXiv:1512.04455.
[12] B. Asadi and A. Vahidi, “Predictive cruise control: Utilizing upcoming [37] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber, “Solving deep
traffic signal information for improving fuel economy and reducing trip memory POMDPS with recurrent policy gradients,” in Proc. Int. Conf.
time,” IEEE Trans. Control Syst. Technol., vol. 19, no. 3, pp. 707–714, Artif. Neural Netw., 2007, pp. 697–706.
May 2011. [38] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[13] S. R. Deshpande, S. Gupta, A. Gupta, and M. Canova, “Real-time eco- Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
driving control in electrified connected and autonomous vehicles using ap- [39] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
proximate dynamic programming,” J. Dyn. Syst., Meas. Control, vol. 144, with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2,
2022, Art. no. 011111. pp. 157–166, Mar. 1994.
[14] S. R. Deshpande et al., “In-vehicle test results for advanced propulsion [40] O. Vinyals et al., “Grandmaster level in StarCraft II using multi-agent
and vehicle system controls using connected and automated vehicle infor- reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
mation,” SAE Tech. Paper 2021-01-0430, 2021. [41] C. Berner et al., “Dota 2 with large scale deep reinforcement learning,”
[15] G. Guo and Q. Wang, “Fuel-efficient en route speed planning and tracking 2019, arXiv:1912.06680,.
control of truck platoons,” IEEE Trans. Intell. Transp. Syst., vol. 20, no. 8, [42] V. Mnih et al., “Human-level control through deep reinforcement learning,”
pp. 3091–3103, Aug. 2019. Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[16] G. Guo and D. Li, “PMP-based set-point optimization and sliding-mode [43] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, “Learning
control of vehicular platoons,” IEEE Trans. Computat. Social Syst., vol. 5, values across many orders of magnitude,” in Proc. Adv. Neural Inf. Process.
no. 2, pp. 553–562, Jun. 2018. Syst., 2016, pp. 4287–4295.
[17] G. Guo, D. Yang, and R. Zhang, “Distributed trajectory optimization and [44] Z. Zhu, Y. Liu, and M. Canova, “Energy management of hybrid eletric
platooning of vehicles to guarantee smooth traffic flow,” IEEE Trans. Intell. vehicles via deep Q networks,” in Proc. IEEE Amer. Control Conf., 2020,
Veh., vol. 8, no. 1, pp. 684–695, Jan. 2023. pp. 3077–3082.
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DEEP REINFORCEMENT LEARNING FRAMEWORK FOR ECO-DRIVING IN CONNECTED 1725
[45] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- Shobhit Gupta (Member, IEEE) received the bach-
dimensional continuous control using generalized advantage estimation,” elor of technology degree in mechanical engineering
2015, arXiv:1506.02438. from the Indian Institute of Technology Guwahati,
[46] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” Guwahati, India, in 2017, the M.S. and Ph.D. de-
in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. grees in mechanical engineering from The Ohio State
[47] R. J. Williams and J. Peng, “Function optimization using connection- University, Columbus, OH, USA, in 2019 and 2022,
ist reinforcement learning algorithms,” Connection Sci., vol. 3, no. 3, respectively. He is currently a Propulsion Controls
pp. 241–268, 1991. Researcher with General Motors Global Research
[48] S. Gupta, S. R. Deshpande, P. Tulpule, M. Canova, and G. Rizzoni, “An and Development, Detroit, MI, USA. His research
enhanced driver model for evaluating fuel economy on real-world routes,” interests include AI/ML and optimal control applica-
IFAC-PapersOnLine, vol. 52, no. 5, pp. 574–579, 2019. ble to battery management systems, connected and
[49] S. Gupta et al., “Estimation of fuel economy on real-world routes for autonomous vehicles, and driver behavior recognition for predictive control.
next-generation connected and automated hybrid powertrains,” SAE Tech.
Paper 2020-01-0593, 2020.
[50] A. AASHTO, Policy on Geometric Design of Highways and Streets.
Washington, DC, USA: Amer. Assoc. State Highway. Transp. Officials,
2001. Abhishek Gupta (Member, IEEE) received the
[51] O. Sundstrom and L. Guzzella, “A generic dynamic programming MAT- B.Tech. degree in aerospace engineering from IIT
LAB function,” in Proc. IEEE Control Appl., Intell. Control, 2009, Bombay, Mumbai, India, in 2009, the M.S. degree in
pp. 1625–1630. applied mathematics from the University of Illinois at
[52] Z. Zhu, S. Gupta, N. Pivaro, S. R. Deshpande, and M. Canova, “A GPU Urbana-Champaign, Champaign, IL, USA, in 2012,
implementation of a look-ahead optimal controller for eco-driving based the M.S. and Ph.D. degrees in aerospace engineering
on dynamic programming,” in Proc. IEEE Eur. Control Conf., 2021, from UIUC, in 2014, and. He is currently an Asso-
pp. 899–904. ciate Professor with ECE Department, The Ohio State
[53] Ohio Supercomputer Center, “Ohio Supercomputer Center,” 1987. [On- University, Columbus, OH, USA. His research inter-
line]. Available: http://osc.edu/ark:/19495/f5s1ph73 ests include stochastic control theory, probability the-
[54] T. Phan-Minh et al., “Driving in real life with inverse reinforcement ory, game theory with applications to transportation
learning,” 2022, arXiv:2206.03004. markets, electricity markets, and cybersecurity of control systems.
[55] M. Vitelli et al., “SafetyNet: Safe planning for real-world self-driving
vehicles using machine-learned policies,” in Proc. IEEE Int. Conf. Robot.
Automat., 2022, pp. 897–904.
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on September 17,2024 at 14:19:44 UTC from IEEE Xplore. Restrictions apply.