0% found this document useful (0 votes)
12 views6 pages

SSRN 4768234

Uploaded by

Al-Shuka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

SSRN 4768234

Uploaded by

Al-Shuka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Balancing a Cart Pole Using Reinforcement

Learning in OpenAI Gym Environment


Shaili Mishra Anuja Arora
Department of Computer Science Department of Computer Science
Jaypee Institute of Information and Jaypee Institute of Information and
Technology Technolgy
http://orchid.org/0000-0002-2628-4879 http://orchid.org/0000-0001-2515-1300
[email protected]
[email protected]
Abstract— Reinforcement Learning (RL) is a subcategory of Reinforcement Learning (RL) executes through the concept of
machine learning. The special feature of reinforcement learning the feedback process, where the agent has to interact with the
that distinguishes it from other machine learning approaches is environment and perform the action on the particular state for
the self-training of the agent from obtained information and optimal reward function according to requirements. This
feedback from the environment. The appropriate action process in the loop is a trial-and-error procedure where the
selection guided the agent towards the better optimal solution. agent decides which particular action on that specific state
The agent has no prior knowledge about the environment, the helps achieve the target state in the system.
agent has to explore each aspect of the environment based on
feedback. This principal advantage of RL algorithms suits After performing action on the current state next state is
complex optimal control problems such as a Cart pole problem generated by the environment and as reward feedback is
inverted pendulum problem and robotics where no prior returned to the agent. After performing this procedure in the
information on the system dynamics is available. In this paper, loop, the agent learns the best action on the state for achieving
the traditional mechanical Cartpole system is controlled by a maximum cumulative reward [11].
using Q-learning models, and for evaluation measures, Mean
Square error (MSE) and Mean Absolute Error (MAE) are Mostly mechanical and underactuated systems are very
applied in algorithms with the introduction of OpenAI Gym suitable for reinforcement learning-based research areas due
environment. to their dynamic and complex nature. In Reinforcement
Learning, the agent evaluates each aspect of real-time systems
Keywords— Reinforcement Learning, Cart Pole, OpenAI to ensure its optimal performance. So, for the real time
Gym, Q Learning systems reinforcement learning algorithms such as Q-learning
and deep Q-learning are very popular among researchers. The
I. I NTRODUCTION proficiency of RL has impelled towards the various domains
Reinforcement Learning is a training method of machine and solved the domain-specific challenges efficiently. The
learning involving reward and penalty for desired behavior applications RL are mostly in these fields’ robotics, game
and for undesired behavior respectively. The Reinforcement playing, self-driving cars, and resource management. Drug
Learning agent through consciousness interprets its discovery and financial trading. In robotics, RL trains the
environment, takes actions, and modifies its learning agent to learn complex tasks such as grasping objects, and
paradigms about its environment. obstacle detection. Similarly, in the gaming areas, the RL
agents got expertise similar to human experts. For
The agent selects a particular action according to the
autonomous driving, the RL agent makes real-time decisions
interaction with the robust and dynamic environment. Such based on traffic navigation scenarios and takes decisions
robust and dynamic controllers are widely used in real-world
according. Another application of RL is resource
problems as PID controllers and fuzzy controllers where
management, The RL agent is responsible for optimizing the
frequent adjustments are required for efficient performance of resources as inventory management, traffic navigation, and
applications. When RL is applied to such a robust system, the
improving efficiency by optimizing energy distribution.
agent has to interact with the unknown environment and try to
Similarly. RL trained the agent for identifying potential drug
achieve maximum cumulative reward [2]. The traditional candidates and analyzing market data and based on these
methods for these mechanical systems are made up of existing
optimized data performed the strategy for maximum returns.
physics-based concepts and mathematical formulations. But
these traditional procedures are executed manually by twisting Reinforcement Learning computes the optimal solution
control parameters which induces various issues and errors in which is the maximum result in the minimum time for
the execution of mechanical systems. [1, 2]. complex and dynamic problem domains. The agent
understands the environment after performing the same
According to the learning process, methods, and
procedure repeatedly and training its knowledge about the
applications, machine learning algorithms are divided into
supervised learning, unsupervised learning, semi-supervised, environment.
and reinforcement learning. In supervised machine learning, In previous research work, several studies were performed
the mapping between some input data and output data is to solve the traditional control problem of cart pole balancing.
already available, known as labeled data and by using these However, most studies considered the theoretical aspects of
predefined labeled data the machine trains itself and then physics for solving the cart pole problem. So, for further
predicts output for future input values. research, reinforcement learning approaches attract the
attention of the researcher for such physics-based problems.
The supervised learning algorithm is suitable for In this research work, the concept of Q learning is proposed
classification and regression problems. While in unsupervised with two reward functions mean square error (MSE) and
learning, the algorithm tries to discover patterns between
Mean Absolute error (MAE). The fast and stable convergence
unlabeled datasets given as input. [13] Unsupervised learning of the cart pole balancing problem obtained by Q-learning
applied to clustering and association problems. with the MSE and MAE as reward function evaluated and

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Electronic copy available at: https://ssrn.com/abstract=4768234


performed more efficiently than traditional physics-based (𝑀+𝑚)𝑔 sin 𝜃− cos𝜃[𝐹+𝑚𝑙𝜃 ̇2 sin 𝜃]
concepts. 𝜃̈ = 4 (3)
( )( 𝑀+𝑚)𝑙−𝑚𝑙𝑐𝑜𝑠2𝜃
3
The following sections are about the research problem -the
{𝐹+𝑚𝑙[𝜃̇ 2 sin 𝜃− 𝜃̈ cos𝜃]}
cart pole balancing, an overview of Reinforcement Learning, 𝑥̈ = ()
OpenAI Gym and Q Learning, and reward functions. The cart (𝑀+𝑚)

pole system is defined in Section II in detail, and Section III is


about RL concepts. OpenAI Gym explained in Section IV and The Cart pole mechanical system is highly dynamic in
Section V defined different reward functions and the nature. The agent has to control the input parameters in such a
conclusion in Section VI. manner that the pendulum is balanced around its center of
mass above the moving cart. In the simulation of the cart pole
II. THE CART-POLE BALANCING PROBLEM action space is defined as {LEFT, RIGHT}, which means the
cart can move horizontally either in Left or Right direction.
By using the Reinforcement learning algorithms, the RL
agent is trained to balance the pole joint with the cart at some The mathematical formulation for state space parameters
pivot point which moves horizontally on the surface. The cart of nonlinear dynamics of the cart pole mechanism is defined
pole system mainly has two components a simple cart and a in equation 5.[2]
vertical bar. The pole on the cart was fixed at some pivot point
and the cart can move in left or write directions as explained 𝑥̇
𝑥̇ {𝐹+𝑚𝑙[𝜃̇ 2 sin 𝜃− 𝜃̈ cos 𝜃]}
in Fig 1. (𝑀+𝑚)
In the Cart Pole Environment, the agent explores all the [𝑥̈ ̇ ] = (5)
𝜃 𝜃̇
possible actions and their corresponding rewards. Then, 𝜃̈
̇ sin 𝜃]
(𝑀+𝑚)𝑔 sin 𝜃− cos𝜃[𝐹+𝑚𝑙𝜃 2
update its policy towards optimal reward achievement. The [
4
( )(𝑀+𝑚)𝑙−𝑚𝑙𝑐𝑜𝑠2𝜃 ]
3
goal of this problem is to find a control policy for balancing
the pole in an upward direction by using bidirectional force
applied to the cart[12].
III. REINFORCEMENT LEARNING
A. Basic Concepts
Reinforcement learning is a subfield of machine learning
that focuses on self-determining decision-making abilities
following the time constraint. In Reinforcement Learning
(RL), the observation parameters are dependent upon an
agent’s behavior, which means the agent should be efficient
enough to learn and can improve its performance by learning
rather than showing false impressions and negative feedback
all the time. So, the agent tries to learn about the anonymous
environment by following the trial-and-error procedure for
learning optimal action state space and achieving its goals.
Reinforcement Learning includes two major entities
Agent and Environment and Actions, reward, and observation
Fig. 1. Cart Pole Dynamics and Control Parameters (Adapted from [1][3]) are their communication channels, as shown in Fig 2. So, the
reinforcement learning entities and their communications are
In Fig 1, the dynamic system of the classical cart pole is as follows:
explained. The cart is moving on a fixed frictionless surface
horizontally due to force F, 𝜃 is the deviation of the pole from • The Agent: An agent is an entity that interacts with the
the pivot point [3]. So, for the cart pole system state space environment by performing some actions, afterward
parameters are defined with the help of a four-dimensional taking observations, and finally receives rewards as
vector {𝓍, 𝓍̇ , 𝜃, 𝜃̇} where x is the horizontal distance traveled feedback from the environment.
by the cart and 𝓍̇ is the linear acceleration of the cart. The cart • The Environment: The environment refers to the external
pole system’s mathematical formulation is defined in system through which an agent interacts. The agent
Equation 1. communicates with the environment and receives these
information rewards generated by the environment,
(𝑀 + 𝑚)𝓍̈ + 𝜖𝓍̇ + 𝑚𝑙𝜃̈ cos𝜃 − 𝑚𝑙𝜃̇ 2 sin 𝜃 = 𝐹(𝑡) (1) actions performed by the agent on the environment, and
observations the agent receives from the environment.
4
𝑚𝑙𝑥̈ cos 𝜃 + 𝑚𝑙 2 𝜃̈ − 𝑚𝑔𝑙 sin𝜃 = 0 (2) • Reward: Reward is a numerical value, received from the
3
environment. Reward values either be positive or
In the above-defined Equation 1 and Equation 2, x(t) is the negative. The main purpose of reward is to train the agent
distance traveled by the cart pole on a non-friction surface to achieve the final goal. The
from the centre point, and 𝓍̇ and 𝓍̈ represent the velocity of
the cart and acceleration respectively. For the mathematical • Actions: Actions are moves that can an agent perform in
computation of the angular acceleration of the pole 𝜃̈ and the environment. In the RL, there are two types of actions
linear acceleration of cart 𝓍̈ formulas defined in Equation 3 discrete or continuous actions performed on the
and Equation 4 were applied. environment. The agent chooses the actions according to

Electronic copy available at: https://ssrn.com/abstract=4768234


their current states and tries to learn the most effective preferences for the immediate rewards. The value of
actions to achieve their goal in the given environment. the discount factor lies between [0,1].
• Observations: The observation is information about the
environment on that particular time stamp. Observation
provides information about the current state of the agent IV. OPENAI GYM AND Q-LEARNING
on that particular time stamp, possible action space for The OpenAI gym is a standard application programming
that current state, and other environmental information. interface for solving reinforcement learning for environments
as classic control and toy text, Atari games, 2D and 3D robots.
OpenAI Gym provides the interface for several classical
control engineering environments. These interfaces test the
efficiency of reinforcement learning so that proposed
algorithms can be applied to mechanical systems such as
robots, medical fields, etc.
In this paper, For the Cart pole problem, OpenAI Gym is
used. In the environment, a pole is attached by a pivot point to
a frictionless cart. The pendulum is placed in the upward
direction and the cart moves left and right on the surface. In
Fig. 2. The reinforcement learning process (Adapted from [4]) the Cart pole, the agent trying to keep the pole upright.
Initially, the pendulum starts from an upward direction and the
The AI agent selects an action from the action spacemoves system aims to prevent the pole from falling after applying
toward a new state and receives a reward from the force on the cart. The action space of the crat pole is two
environment as feedback. After repeating these steps, the discrete values (0,1), 0 represents push the cart in the left
agent learns which action is best in a particular state to obtain direction and 1 means push the cart in the right direction
the maximum cumulative reward. As shown in Fig 2, in each according to Figure 3. After performing an action on state, the
iteration the agent receives current state s from the environment produces an observation state space which
environment, then after applying an action a the agent’s state consists cart’s position, cart’s velocity, the pole angle, and the
changes. After repeatedly performing this process, the agent angular velocity of the pole. The cart position lies between (-
learns from the obtained experience regarding state, action, 4.8, 4.8) and the termination condition is (-2.4, 2.4). The pole
and corresponding next state and reward. This knowledge angle observed between (±24°) and episodes terminates when
helps the agent to achieve the cumulative reward to achieve the pole lies outside (±12°) range. The +1 reward is assigned
the goal. The main goal of the Reinforcement learning for balancing the pole in the upward direction on the cart as
algorithm is to compute the optimal policy for the given long as possible.[8]
problem.[5, 7]

B. Markov Decision Process (MDP)


The Markov Decision Process (MDP) represents the
mathematical framework for dynamic decision-making
situations where the performance of the system is influenced
by random factors and uncertain system parameters [6]. The
MDP framework consists of four key terms such as state S,
action A, Transition Probability P, and reward R. So, The
MDP for the cart pole problem is < 𝑆, 𝐴, 𝑃, 𝑅, 𝛾 >, where-
• State (S): The state space parameters in the cart pole
problem represent the current status of the agent which
includes cart position, cart velocity, pole angle, and
pole angular velocity.
• Action (A): The action set A is about all possible Fig. 3. Action state parameters for Cart pole mechanism [2]
movements that control the dynamics of the cart and
pole. In the cart pole environment, the movement of Q-learning is a value-based reinforcement learning
the cart is only possible on the left or right. algorithm in which the environment is not familiar to the agent
and the agent has to figure out the best actions for obtaining
• Transition probability (P): P is about the probability an optimal solution. In the Q-learning method the samples
distribution given for the current state over the next (𝑆, 𝐴, 𝑅, 𝑆 ′ ) generated by following policy to maximize
possible potential successor state. Q (𝑆 ′ , 𝐴′ ) values for achieving the desired target. For the
• Reward (R): Reward R is a numerical value associated formulation of the Q value, the ɛ-greedy policy is applied for
with a state-action pair that converges the agent’s samples (𝑆, 𝐴, 𝑅, 𝑆 ′ ) defined in equation (6)
learning process towards the optimal maximum
cumulative sum of reward. 𝑄(𝑆, 𝐴) = 𝑅(𝑆, 𝐴) + 𝛾𝑚𝑎𝑥𝐴𝑄(𝑆 ′, 𝐴) (6)

• Discount factor (𝛾): The discount factor 𝛾 determines Where, Q (S, A) is the Q-value at state S for action A, and for
the influence of the future rewards and determines the computing Q- value immediate reward R(S, A) and maximum

Electronic copy available at: https://ssrn.com/abstract=4768234


Q-value from the next state S’ required. gamma (γ) is a Where, N is the total number of samples and 𝑦𝑖 and 𝑦̂𝑖 are
discount factor that decides the importance of future the predicted value and actual value of the particular
rewards[7, 10]. The value of Q (S’, A) depends upon future sample.
Q-values, as defined in equation (7) VI. EXPERIMENTS AND RESULTS
𝑄(𝑆, 𝐴) = 𝛾𝑄(𝑆 ′ ,𝐴)
This section provides the details of experiments and their
+ 𝛾 2 𝑄(𝑆 ′′ ,𝐴) …… 𝛾 𝑛𝑄(𝑆 ′′….𝑛 ,𝐴) (7) outcomes after performing various reward functions described
in previous sections.
For computing Q-value for action 𝐴𝑡 at state 𝑆𝑡 value of
maximum Q action 𝑎𝑟𝑔𝑚𝑎𝑥𝐴′ . 𝑄(𝑆 ′, 𝐴′ ) for state S’ is A. Hyperparameters Setting
required followed by the concepts of exploitation. To update The results of both reward functions MSE and MAE for
cart pole problem solved by the Q learning approach are
the value of Q-value equation (8) is defined
presented in this section. The validation of the training process
𝑄(𝑆𝑡 ,𝐴𝑡 ) = 𝑄(𝑆𝑡 , 𝐴𝑡 ) + for both reward functions is performed by varying
∝∗ [𝑅 𝑡+1 + 𝛾 ∗ 𝑚𝑎𝑥𝐴𝑡+1𝑄(𝑆𝑡+1 ,𝐴𝑡+1 ) hyperparameters and their adjustments for better convergence.
− 𝑄(𝑆𝑡 ,𝐴𝑡 )] (8) The details of hyperparameters are stated in Table I.
TABLE I. THE Q LEARNING ALGORITHMS P ARAMETER DETAILS FOR
In reinforcement learning, the agent has to choose whether to CART POLE OPEN AI ENVIRONMENT
continue with the current knowledge about the state, actions, Parameters Value
and rewards or to explore other options. Exploration is a Gamma 0.99
greedy approach where the agent focuses on improving their Episodes 100
knowledge about the environment for long-term benefits. epsilon 0.99
Activation function Tanh, linear
While in exploitation, the agent tries to compute maximum
Learning rate 1e-2
rewards by exploiting current knowledge rather than
knowledge gathering. So, in exploration, the agent
persistently gathers information to obtain optimal results B. Performance of various reward functions in Q learning
while in exploitation, the agent optimizes the decision based To evaluate the performance of the Cartpole environment,
on current information available. two reward functions Mean Square Error and Mean Absolute
error implemented as reward functions in Q learning
algorithms. In the training procedure 100 episodes were
V. REWARDS generated and based on that samples mean and median were
computed for reward functions as shown in Table II
In reinforcement learning, reward refers to feedback or
numerical value that is generated by the environment and TABLE II. THE C ART POLE’S R EWARD F UNCTIONS
received by the agent after taking action on a particular state.
The reward function helps the agent to learn about the Reward Mean Max Min Median
Functions
environment and update its knowledge about the system. The
MSE 26.61386 88 9 22
primary goal of the agent is to choose the action state pairs in
MAE 25.36634 85 8 20
a way that leads toward the maximum or minimum reward. In
this paper, for the performance measures of Q learning for the
cart pole problem, Mean Squared Error (MSE) and Mean Figure 4 and Figure 5 show the reward functions MSE and
Absolute Error (MAE) are applied to guide the RL agent for MAE for Q learning algorithms applied to the environment of
optimal decision-making policy. the Cart pole. The violin plot is a combination of a box plot
and a probability density function. The white dot in the box in
• Mean Squared Error Loss (MSE):
Figure 4 depicts the median for a specific reward function and
Mean Squared Error measures the average value of the the distribution of the reward function is defined by violin
squared difference between the predicted and the actual plots. The boxplots are used for uniform distribution while the
value. Mathematical formulation for computation of mean violin plot reveals their different distribution. The violin plot
square error is defined in equation (9) shown in Figure 4 proves that Mean Square Error gave
1 weightage to each outlier value while MAE ignored them.
𝑀𝑆𝐸 = ∑𝑁 ̂𝑖 ) 2
𝑖=1(𝑦𝑖 − 𝑦 (9) The shape of violin plots shows that for MSE and MAE
𝑁
reward is distributed near the mean value.[9]
Where, 𝑦𝑖 and 𝑦̂𝑖 represents the predicted value and actual
value of the particular sample and N is the total no. of
samples.

• Mean Absolute Error Loss (MAE)


MAE evaluates the average of absolute difference between
observation entities to the prediction entities. The formula
is defined in equation (10):
1
𝑀𝐴𝐸 = ∑𝑁
𝑖=1|𝑦𝑖 − 𝑦
̂𝑖 | (10)
𝑁

Electronic copy available at: https://ssrn.com/abstract=4768234


Figure 5(b): MAE Q learning performance

Fig. 4. MSE Q-learning plots(0 = MSE Reward, 1= MAE Reward)


Figure 5(c): Comparative plot for MSE Q learning and MAE Q
In the figure 5(a), the performance of MSE is depicted. Learning performance
Initially, up to 20 episodes value of the reward function varies
between 35 and 7 and the maximum reward for MSE is 88 at
episode 72. Figure 5(b) shows the reward and episode plots VII. CONCLUSION
for MAE reward functions. This graph shows the maximum Reinforcement Learning approaches implement the
reward is 85 but MAE ignores the outlier values. mathematical methodology for computing optimal solutions
Figure 5(c) is a comparative line plot of both reward to determine the best decision-making strategies for agents.
functions MSE and MAE. Table II shows the comparative The agent is trained for future perspective according to those
details about both MSE and MAE reward functions which strategies and seeks the best solution in a specific scenario. In
this paper, Q-learning with Mean Squared Error (MSE) and
include mean, median, max, and min for both MSE and MAE.
Mean Absolute Error (MAE) as reward functions are applied
to the cart pole system. The performance evaluation of both
proposed approaches is based on balancing of pole with
maximum reward. According to that Q-Learning with MAE
ignores the outliner values while Q-Learning with MSE gives
importance to outlier values. In future work, more RL models
can be applied to the Cart Pole problem and compare their
performance.
REFERENCES
[1] Mishra, S., & Arora, A. (2023). A Huber reward function-driven deep
reinforcement learning solution for cart-pole balancing problem.
Neural Computing and Applications, 35(23), 16705-16722.
[2] Mishra, S., & Arora, A. (2022). Double Deep Q Network with Huber
Reward Function for Cart-Pole Balancing Problem. International
Figure 5(a): MSE Q learning performance Journal of Performability Engineering, 18(9), 644.
[3] Kumar, S.: Balancing a cartpole system with reinforcement learning–a
tutorial. arXiv preprint arXiv:2006.04938 (2020)
[4] Gym, O., Sanghi, N.: Deep reinforcement learning with python.
[5] Samsuden, M. A., Diah, N. M., & Rahman, N. A. (2019, October). A
review paper on implementing reinforcement learning technique in
optimising games performance. In 2019 IEEE 9th international
conference on system engineering and technology (ICSET) (pp. 258-
263). IEEE.
[6] Jia, J., & Wang, W. (2020, October). Review of reinforcement learning
research. In 2020 35th Youth Academic Annual Conference of Chinese
Association of Automation (YAC) (pp. 186-191). IEEE.

Electronic copy available at: https://ssrn.com/abstract=4768234


[7] Shi, Q., Lam, H. K., Xiao, B., & Tsai, S. H. (2018). Adaptive PID in computing, communications and informatics (ICACCI) (pp. 26-32).
controller based on Q‐ learning algorithm. CAAI Transactions on IEEE.
Intelligence Technology, 3(4), 235-244. [11] Ladosz, P., Weng, L., Kim, M., & Oh, H. (2022). Exploration in deep
[8] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., reinforcement learning: A survey. Information Fusion, 85, 1-22.
Tang, J., & Zaremba, W. (2016). Openai gym. arXiv preprint [12] Huang, X. (2022). Opponent cart-pole dynamics for reinforcement
arXiv:1606.01540. learning of competing agents. Acta Mechanica Sinica, 38(5), 521540.
[9] Ada, S. E., & Ugur, E. (2023). Meta-World Conditional Neural [13] Mothanna, Y., & Hewahi, N. (2022, November). Review on
Processes. arXiv preprint arXiv:2302.10320. Reinforcement Learning in CartPole Game. In 2022 International
[10] Nagendra, S., Podila, N., Ugarakhod, R., & George, K. (2017, Conference on Innovation and Intelligence for Informatics,
September). Comparison of reinforcement learning algorithms applied Computing, and Technologies (3ICT) (pp. 344-349). IEEE.
to the cart-pole problem. In 2017 international conference on advances

Electronic copy available at: https://ssrn.com/abstract=4768234

You might also like