0% found this document useful (0 votes)
4 views

RL QOptimizer a Reinforcement Learning Based Query Optimizer

The paper presents RL_QOptimizer, a novel query optimizer based on deep reinforcement learning aimed at improving query execution times in database management systems. It addresses the challenges of selecting optimal join orders in queries, which significantly impact performance, by utilizing a reward system to determine the best execution plans. Experimental results indicate that RL_QOptimizer outperforms existing models, such as those used in PostgreSQL, by enhancing execution efficiency and reducing the time required to generate execution plans.

Uploaded by

xzwfjxcxxq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

RL QOptimizer a Reinforcement Learning Based Query Optimizer

The paper presents RL_QOptimizer, a novel query optimizer based on deep reinforcement learning aimed at improving query execution times in database management systems. It addresses the challenges of selecting optimal join orders in queries, which significantly impact performance, by utilizing a reward system to determine the best execution plans. Experimental results indicate that RL_QOptimizer outperforms existing models, such as those used in PostgreSQL, by enhancing execution efficiency and reducing the time required to generate execution plans.

Uploaded by

xzwfjxcxxq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 5 June 2022, accepted 25 June 2022, date of publication 29 June 2022, date of current version 8 July 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3187102

RL_QOptimizer: A Reinforcement Learning


Based Query Optimizer
MOHAMED RAMADAN 1 , AYMAN EL-KILANY1 , HODA M. O. MOKHTAR 1,2 ,

AND IBRAHIM SOBH 3


1 Information Systems Department, Faculty of Computers and Artificial Intelligence, Cairo University, Cairo 12613, Egypt
2 Faculty of Computing and Information Sciences, Egypt University of Informatics, Cairo 11865, Egypt
3 Valeo, Cairo 12577, Egypt

Corresponding author: Mohamed Ramadan ([email protected])

ABSTRACT With the current availability of massive datasets and scalability requirements, different systems
are required to provide their users with the best performance possible in terms of speed. On the physical
level, performance can be translated into queries’ execution time in database management systems. Queries
have to execute efficiently (i.e. in minimum time) to meet users’ needs, which puts an excessive burden
on the database management system (DBMS). In this paper, we mainly focus on enhancing the query
optimizer, which is one of the main components in DBMS that is responsible for choosing the optimal
query execution plan and consequently determines the query execution time. Inspired by recent research in
reinforcement learning in different domains, this paper proposes A Deep Reinforcement Learning Based
Query Optimizer (RL_QOptimizer), a new approach to find the best policy for join order in the query plan
which depends solely on the reward system of reinforcement learning. The experimental results show that a
notable advantage of the proposed approach against the existing query optimization model of PostgreSQL
DBMS.

INDEX TERMS Join ordering problem, query execution plan and query optimization.

I. INTRODUCTION the join order, the order in which the tables of a query are
In DBMS, a single query can be executed through different joined can have a dramatic effect on the query execution
execution plans. The query optimizer attempts to choose the time. In addition, the number of possible join orders increases
most efficient way to execute a given query from the space exponentially with the number of tables [2]. Hence, the query
of execution plans. Most DBMSs use the cost-based model optimizer can’t compute the different costs for all combina-
for query optimization where the optimizer estimates the cost tions to select the best join order during query execution.
of the execution plan and then selects the optimal plan that Consequently, most optimizers use heuristics such as consid-
minimizes the cost among the set of candidate plans [1]. ering the shape of the query tree [3] - to prune the search
As the number of intermediate rows (results) is unknown at space. In this paper, we propose two versions of A Reinforce-
run time, the Optimizer uses pre-calculated statistics such as ment Learning Based Query Optimizer (RL_QOptimizer) to
information about the distribution of data values and cardi- identify the best execution plan based on the reward system
nality estimation to estimate the cost of a plan rather than of reinforcement learning. The first model uses Reinforce-
calculating the real cost of querying the data during the query ment Learning and the second one uses Deep Reinforcement
plan. One of the main challenges in query optimization and Learning [4], [5].
query plan generations is the selection of the order to perform The main contributions of this work are:
the join operation between tables (i.e. relations). Even if the 1) Proposing a new query optimizer model
final results of the query could be the same regardless of (RL_QOptimizer) for optimizing tables’ join orders
that is based on the Deep Reinforcement Learning
The associate editor coordinating the review of this manuscript and technique. Deep Reinforcement Learning is used to
approving it for publication was R. K. Tripathy . find the optimal query execution plan.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
70502 VOLUME 10, 2022
M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

2) Using a real environment to train the proposed model


so that real feedback from the DBMS is obtained and
consequently employed to discover the optimal execu-
tion plan.
3) Enhancing the overall execution plan time and propos-
ing a new query optimizer that requires low or almost
constant time to generate the execution plan for any
query with any number of joins.
The rest of the paper is organized as follows: Section 2
presents background about the concepts used in this paper.
Then, previous work related to the proposed models is dis-
cussed in Section 3. Section 4 details the proposed models
and their different architectures. In section 5, the results FIGURE 1. Customer-ordering database ERD.
of the performance evaluation of the proposed models are
presented. Finally, Section 6 concludes the paper.

II. BACKGROUND Heuristic rules would recommend performing the query


A. QUERY OPTIMIZATION selection operator first in order to reduce query intermediate
Query Optimization is the process of choosing a suitable results [1] as shown in Figure 2. The following are the heuris-
execution strategy for processing a query [1]. A traditional tic query optimization steps [1]:
query optimizer uses stored statistics and probabilities rules
1) Designing the initial tree of the query.
to estimate the cardinalities of the different tables and conse-
2) Moving the SELECT operation down the query tree.
quently finds the optimal query plan. These statistics include
3) Applying the more restrictive SELECT operation first.
the number of records, the number of blocks, the number
4) Replacing CARTESIAN PRODUCT and SELECT with
of distinct values in each column, and the selectivity of
JOIN operation.
each attribute that represents the average number of records
5) Moving PROJECT operations down the query tree.
satisfying an equality condition [1]. The goal of the query
optimizer is to generate a query execution plan that minimizes σ
The SELECT operation ( c(R)) is used to select a subset
the overall query execution time. A traditional query opti- of tuples from a relation that satisfies a condition specified
mizer depends on heuristic and cost-based optimization. For in the selection. The selection operation is also known as
example, applying the SELECT and PROJECT operations horizontal partitioning because it partitions the table or rela-

π
before the JOIN operations or applying the most restrictive tion horizontally, where ‘c’ is the selection condition, which
SELECT operations first before other SELECT operations is a Boolean condition. The PROJECT operation ( A(R))
are examples of heuristics rules that mostly guarantee less is used to select certain attributes while discarding others.
execution time when applied on the execution plans. In the The Project operation is also known as vertical partitioning
cost-based optimization step, the optimizer estimates and because it partitions the relation or table vertically, discarding
compares the costs of query execution based on statistics and other columns or attributes, where ‘A’ is the attribute list,
cardinality estimations using different execution strategies which is the desired set of attributes from the attributes of
and algorithms. Then, it chooses the strategy with the lowest relation(R), and finally, the JOIN operation (R1 FG R2) is
cost estimate [1]. The lowest cost estimate is usually found used to join two tables R1 and R2 based on the join condition.
by performing the operations that initially reduce the size of The outcome of joining two or more relations is a set of all
intermediate results. possible tuple combinations with the same common attribute.
Example 1: Consider the following query on the
Customer-Ordering database presented in Figure 1. The B. JOIN ORDERING PROBLEM
Customer-Ordering database has five entities which are A ‘‘join’’ operation is a relational operation that combines
‘‘ORDER’’, ‘‘CUSTOMER’’, ‘‘PRODUCT’’, ‘‘CATEGORY’’, rows from two tables based on a related column. While join
and ‘‘ADDRESS’’ with the cardinalities 1000000, 100000, works with only two tables at a time, a query that joins N
10000, 1000, and 1000 rows for each table respectively. tables is executed through N-1 joins. The optimizer needs
to take a critical decision regarding the selection of the
SELECT C.NAME, C.ADDRESS, P.PRICE, optimal join order which greatly influences the execution
P.NAME FROM CUSTOMER AS C CROSS JOIN time of a query. The process of choosing an efficient join
ORDER AS O CROSS JOIN PRODUCT AS P order is difficult as the number of possible join combinations
WHERE CUSTOMER.ID = O.CUSTOMER_ID that the optimizer needs to explore and analyze increases
AND P.ID = O.PRODUCT_ID exponentially with the number of tables [2]. In addition, the
AND C.PHONE_NUMBER = “0111” number of intermediate rows (results) is unknown at run time
which forces the optimizer to use pre-calculated statistics and

VOLUME 10, 2022 70503


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

FIGURE 3. Simplified architecture of the RL mechanism.

maximize the total rewards in an environment [7]. Reinforce-


ment learning algorithms learn by performing actions and
receiving rewards or penalties from the environment. The
FIGURE 2. A query execution tree generated by traditional heuristic rules. main goal of the agent is to maximize its cumulative rewards
while interacting with the environment.
cardinality estimation to estimate the cost rather than the real The main elements in reinforcement learning are agents,
cost. environments, states, actions, and a reward value [7].
Example 2: Consider a query on the customer-ordering Actions are the set of all possible actions that the agent
database shown in Figure 1, where we want to get the cus- can choose from. The environment takes the agent action
tomers along with each product they have ordered. We can and current state as input and returns a reward and the next
write this simple query in PostgreSQL like this: state. The reward function defines the goal in a reinforcement
learning problem, it maps the action state (or state-action pair)
of the environment to the reward value (negative or positive).
SELECT * FROM CUSTOMER On each time step, the environment sends to the reinforce-
CROSS JOIN ORDER CROSS JOIN PRODUCT ment learning agent a single value called the reward [7]. The
WHERE CUSTOMER.ID = ORDER.CUSTOMER_ID policy is the strategy that an agent follows to determine the
AND next action based on the current state. Value function V(s)
PRODUCT.ID = ORDER.PRODUCT_ID is the expected long-term reward for an agent starting from
state s under a specific policy. It measures how good to be
If the optimizer chooses to join the relations ‘‘CUSTOMER’’ in a given state. Action Value or Q-Value function Q (s, a)
and ‘‘PRODUCT’’ tables first, it leads to a cross-product measures how good it is to take an action at a given state, it is
as there are no relationships between customer and prod- the expected return or the overall reward for taking action a
uct tables, which accordingly generates a very large set of in a specific state s.
intermediate results (100000 × 10000 = 109 rows) and As shown in Figure 3 the proposed query optimizer (agent)
consequently results in a high execution time for the query but interacts with the DBMS environment) by selecting one of
if the optimizer chooses to join ‘‘CUSTOMER’’ and ‘‘ORDER’’ the join ordering conditions (actions) and receiving a neg-
tables first, it leads to 106 rows as a maximum number of ative execution time (penalty). The reinforcement learning
intermediate results. The query optimizer’s role is to select problems are closely related to optimal control problems,
the best query join order that minimizes the query execution particularly stochastic optimal control problems that can be
time. The better choice is affected by many factors including: formulated as a Markov Decision Process (MDP) [7], [8].
database indexes, tables’ cardinality, data distributions, etc. A Markov process is a stochastic process in which the prob-
This problem is called a join ordering problem that has been lem is a set of possible states and the future state depends
studied by researchers for many years given the huge number only on the current state rather than the history (Markov Prop-
of possible combinations where each candidate plan has a dif- erty). Markov Decision Process is described by the following
ferent effect on the query execution time. The problem has an tuples 1.
exponential complexity while using a dynamic programming
< S, A, P(s, a), R(s, a) > (1)
technique [6].
where S stands for a set of states, A describes the set of actions
C. REINFORCEMENT LEARNING the agent can take, P(s, a) describes a probability distribution
Reinforcement learning is an important area of machine to be in the new state by taking action a in state s and R(s, a)
learning, where an agent learns to take an action to is the reward of taking action a in state s.

70504 VOLUME 10, 2022


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

D. DEEP REINFORCEMENT LEARNING values are unknown at compile-time, the time before the
Neural networks are function approximators that can be used actual execution, the parametric techniques attempt to iden-
in reinforcement learning when the state space or action space tify several execution plans where each plan is optimal for
is very large [4]. Deep reinforcement learning is the result of a subset of possible values of the run-time parameters [6].
applying reinforcement learning using deep neural networks. The goal is to identify candidate plans at compile-time, each
Deep neural networks are used as the agents that learn to map optimal for some region of the parameter space, and the
state-action pairs to rewards. Depending on the result, the optimal plan is selected once the actual parameter values
neural network is encouraged or discouraged by the action become known at run time. This type has many drawbacks
on this input in the future. like the overhead of pruning all plans for the entire relational
Deep Reinforcement learning has been used widely in dif- selectivity space for one query which is not a cost-effective
ferent domains. More specifically, it is used in those domains approach [21]. This approach depends on assumptions that
that a reward/penalty can be given for any action of the agent. may not be true all the time like assuming that a plan is
Google DeepMind Team developed many artificial intel- optimal for all values in a specific region [22].
ligence models using deep reinforcement learning in differ- Most DBMSs use histogram-based techniques as part of
ent games like a model to play Atari games and improves their cost model to summarize the data of tables to perform
itself [4] which used a convolutional neural network, trained efficient selectivity estimations [23], [24]. A large number of
with a variant of Q-learning, whose input is raw pixels and algorithms have been proposed for constructing histograms
whose output is a value function estimating future rewards. over a single attribute and multiple attributes. A new algo-
In addition to developing AlphaGo which is a Chinese game rithm to build a histogram is introduced in [25] to construct it
that challenged artificial intelligence researchers for many by minimizing the aggregated error. As the algorithm needs
years [7] by combining deep neural networks with reinforce- huge construction time, generating an efficient execution plan
ment learning [9]. for a given query from the space of possible execution plans
Deep reinforcement learning is used for other complex is very expensive. In addition, estimating the join operation
problems like autonomous driving [10], [11]. As mentioned selectivity problem may lead to poor execution plans.
in Sallab et al. [10], it is difficult to deal with autonomous
driving as a supervised learning problem due to strong inter- B. LEARNING-BASED QUERY OPTIMIZATION
actions with the environment. Finally, End-to-End Frame- Query Optimization using learning models has become
work for Fast Learning Asynchronous Agents research [12] one of the hot topics in the database research directions
proposed a training framework that combines the benefits [26]–[35]. Some researchers have investigated the feasibility
of imitation learning (IL) and deep RL for fast learning of applying machine learning techniques in query optimiza-
asynchronous agents through extending the Asynchronous tion to improve the query optimization process.
Advantage Actor-Critic (A3C) algorithm. Some of the prior work used supervised learning to learn
from old execution plans that were generated by the query
optimizer for past queries to help in generating execution
III. RELATED WORK
plans for new queries [1]. The authors in [26] proposed an
Related work is categorized into two main directions: Tra-
execution plan recommendation system based on similar-
ditional query optimization and Learning-based query opti-
ity identification between SQL queries. They used machine
mization. Previous work in each of those directions is dis-
learning techniques to improve query similarity detection and
cussed in the following subsections.
hence were able to identify and associate similar queries
having similar execution plans. This algorithm assumes that
A. TRADITIONAL QUERY OPTIMIZATION TECHNIQUES similar textual queries have similar execution plans, however,
Most of the query optimizers rely upon the dynamic pro- this is not always true in real-world where similar textual
gramming approach of System R [13], [14] which was the queries can have different optimal execution plans. In addi-
first implementation of SQL and it pioneered several opti- tion, the paper didn’t use the query optimizers’ feedback to
mization techniques, including the utilization of dynamic enhance the query execution plan.
programming for bottom-up join tree construction. They Other proposed machine learning-based query optimizer
use the traditional cost model to determine the best plan models focus on adjusting incorrect statistics and cardinality
for a given query by generating different strategies using estimates of a query execution plan automatically by learn-
cardinality estimations [1], [15]. These estimates rely upon ing from query optimizer past mistakes. One of the first
statistics on the database and assumptions that may or approaches that focused on adjusting this information is [27]
may not be true. Invalid assumptions or inaccurate calcula- which compares the optimizer’s estimates with the actual
tions for the cardinality estimation lead to poor execution cardinalities during run time and computes the errors. Then,
plans [1], [15], [16]. the model is adjusted to perform better in future runs. Also,
Another type of query optimizer tries to use parametric Adaptive Cardinality Estimation [32] proposes a cardinality
query optimization [17]–[21] Where the traditional optimiz- estimation approach that is integrated with the use of machine
ers make assumptions about many parameters [20] whose learning techniques. The main contribution of this approach

VOLUME 10, 2022 70505


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

is using query execution statistics of the previously executed The SkinnerDB system presented in [37] uses reinforcement
queries to improve cardinality estimations. The proposed learning for query optimization. The proposed model learns
approaches have many issues e.g., they are designed for static the optimal join order while running the query. The possible
queries. In addition, they focus on cardinality estimation join orders are divided into slices where the possible join
so, they still require to use of the traditional cost model order is tested on each slice of the data until the best join
and heuristic rules that may lead to poor plans. Similar to order is obtained and considered for the remaining slices of
the traditional optimizers, the proposed model planning time data. Query performance is evaluated using regret bounds
increases as the number of join conditions increases. Another as a reward system that considers the difference between
approach was proposed by [28] which uses machine learning actual execution time and the time for an optimal join order.
algorithms for cardinality estimation to learn selectivity that Fully Observed Optimizer (FOOP) presented in [38] uses a
takes a bounded range on each column as input. This method reinforcement learning model where the reward function is
focused on the selectivity estimation of several range clauses defined as the cost model of the traditional DBMS optimizer.
but did not consider Queries with joins. In addition, it focused Another model that utilizes reinforcement learning is pre-
on cardinality estimation, not the actual execution time which sented in [33], which is the closest model to the proposed
may affect dramatically the execution time. model in this paper. The model suggests a learning-based
The approaches presented in [29], [30] use a deep rein- technique for join order based on the plans generated by
forcement learning technique in determining the execution the DBMS optimizer to bootstrap the reinforcement learning
plan. ReJOIN model [29] focuses on the Join order selection model before fine-tuning it using real-time execution time.
problem by applying deep reinforcement learning techniques. BAO (the Bandit optimizer) presented in [39], is a learned
In this model, the agent learns to maximize the reward component that sits on the top of an existing query optimizer
through continuous feedback with the help of an artificial in order to enhance query optimization rather than discarding
neural network. ReJOIN used the traditional cost model based the traditional query optimizer. Bao component learns to map
on cardinality estimation during the learning phase rather the query to the best execution strategy for the query. Then,
than the actual execution time which may lead to non-optimal upon receiving a query, the query optimizer generates multi-
plans. Learning State Representations for Query Optimiza- ple plans according to different strategies where the learned
tion discussed in [30] used deep neural networks to learn state model is expected to choose the best query plan given the
representations of queries in order to learn the optimal plans. possible strategies.
More specifically, the paper introduced two approaches, the Another research direction explores the use of deep rein-
first approach transforms a query into a feature vector and forcement learning to administrate a DBMS. The case for
trains a deep neural network to take such vectors as input Automatic Database Administration investigated in [40] pro-
and output the estimated cardinality. The second approach, poses a new model of index selection to decide on which
a recursive approach, in which they train the model to predict attributes to create indexes on for a given workload based on
the cardinality of a query consisting of a single new operation deep reinforcement learning. UDO(the Universal Database
applied to a subquery to incrementally generate a representa- Optimizer) [41] considers a variety of tuning choices, starting
tion of each subquery’s intermediate results [30]. This paper from picking transaction code variants over-index selections
explored the idea of training a deep reinforcement learning up to database system parameter tuning. UDO uses reinforce-
model to predict query cardinalities instead of relying entirely ment learning to converge to near-optimal configurations.
on basic statistics to estimate costs. All of the earlier models have utilized DBMS optimizer
Neo (Neural Optimizer) presented in [31], uses a super- and its generated plans to train or at least bootstrap their
vised learning model to guide a search algorithm through a learned models. Consequently, the purpose of this research
large and complex space. Neo assumes the existence of a is to develop reinforcement learning-based models that learn
sample workload which consists of a set of queries that is directly from real query performance of different join orders
considered a representative of the total workload. In addi- where the models are rewarded or penalized based on the
tion, PostgreSQL optimizer is considered as the expert that actual execution time of different query plans. Furthermore,
is responsible for generating the best query plans. Given the proposed models explore the whole space of different
the sample workload and their best query plans generated query plans to learn the best join order for any given query.
by the expert, the learnt model tries to generalize a model
that can infer the plan with the least execution time for a IV. PROPOSED MODELS
query. In later stages, Neo retrains the supervised learning In this paper two versions of A Reinforcement Learning
model based on the feedback received while running the Based Query Optimizer (RL_QOptimizer) are proposed to
model on its environment. Towards a Hands-Free Query Opti- solve the join ordering problem during query optimization.
mizer through Deep Learning presented in [36], is another Both approaches employ the Q-learning model [5], which
attempt that tries to identify potential complications for future is one of the most popular reinforcement learning algo-
research that uses deep reinforcement in Query optimization rithms. The first approach is a simple RL Q-Learning which
problems. Also, the authors referred to the possibility of uses a simple lookup table (Q-table) to calculate the maxi-
using latency as a reward function in the research directions. mum expected future reward for each action at each state.

70506 VOLUME 10, 2022


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

FIGURE 5. Q-learning.

a state and an action as input and produce the corresponding


Q-value as shown in Figure 5. On the other hand, the agent
performs a set of sequence actions to get the maximum total
reward. The total reward is called the Q-value which can be
calculated by performing an action in a specific state to get the
immediate reward and add it to the highest Q-value possible
from the next state using the following formula 2 [5], [42]:

FIGURE 4. Model overview. Q(s, a) = r(s, a) + γ maxQ(s0 , a) (2)

The first part r(s, a) is the immediate reward for the taken
the action (a) given the state (s). The second part is the
The second approach is a ‘Deep’ Q-Learning-based model, discount factor (γ ) multiplied by the estimate of optimal
which is more suitable for large states and actions as it uses future value max Q(s0 , a) which is known as the discounted
a neural network to approximate the Q-value function. Both estimate of optimal future value.
models operate by applying a set of general steps as shown in The model consists of four components which are: the
Figure 4. input of the model, the states, the set of possible actions and
The system has two main phases that are applied for both the reward function. The preceding equation shows how we
models, the first phase is the generation phase and the compute the Q-value for an action (a) starting from a state (s).
second is the selection phase. It is the sum of immediate rewards, and it takes greedy action
In the generation phase, the model either generates all from the next state(s0 ) (choose the action that has maximum
possible join ordering queries that may happen in the database Q value over other actions).
schema to learn or generate all the join ordering queries The input is typically represented as encoded query join
from a given database workload. Generating all possible join conditions. The characteristics of join conditions are encoded
ordering queries allows the model to be trained from scratch in the form of a vector of size n, where n is the number of all
on every possible scenario For example, if the database has possible join conditions in the database schema. Each cell of
JOINS between A, B, and C, the possible queries will be the vector can be a 0 or 1 where 1 means that this condition
A FG B, A FG C, B FG C, A FG B FG C. On the is included in this query. E.g. Input = [1, 1, 0, 0] means that
other hand, If a database workload is available, the join this query includes first and second join conditions.
ordering queries in the workload will be used in the training The main goal in the join ordering problem is to find the
process by the model. Following that, the system selects best possible join order for a given query. In the proposed
any one of join ordering queries in the selection phase. models, the states represented by 0 or 1’s vector where
All possible execution plans of the selected query will be 1 refers to join conditions that will be applied to a query.
generated to train the model. For each possible execution In each state, the agent has a set of possible actions to select
plan, the agent interacts with the DBMS to get the actual one from them to move from one state to another. These
execution time for this plan, which represents the reward actions are all join conditions of the query represented by
in our models, multiplied by -1 to minimize the execution one’s in-state vector. After selecting an action, the query
time. Both models are discussed in details in the following builder adds this condition in its order and sets its value to
subsections. zero in the state vector.
As no rules exist to correctly choose the reward func-
A. JOIN ORDERING USING REINFORCEMENT LEARNING tion, the choice of the reward function is one of the most
The first model uses Q-table to store the expected reward for challenging tasks in any reinforcement model. In the pro-
each action-state. The main function of the Q-table is to take posed models, the goal is to optimize the total execution

VOLUME 10, 2022 70507


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

time of queries, hence, the actual queries execution time During the actual running when the model is required to
multiplied by -1 is used as a reward. During experiments, generate the best execution plan for a query that is vectorized
PostgreSQL [43] DBMS is used to get the actual execution as [1, 1, 1, 0], the agent moves to the corresponding row in
time of the query. Obviously, the lower the execution time, Q-table to select the condition with maximum reward. If the
the higher the reward. agent selects the third join condition, which is coded as one
In the learning stage the system takes all possible join in the vector, it will be replaced by zero and the new state will
conditions for the given database and generates different be [1, 1, 0, 0]. Then, the agent selects the best join condition
possible queries to train on them. The model then builds given the chosen third join condition and this is also a one
a vectorized representation of the query that is later used in the vector and will be replaced by a zero. This process is
as an input to the model. The agent selects one of the join repeated recursively until the vector reaches the terminal state
ordering conditions in this vector which is represented by which is [0, 0, 0, 0] to retrieve the best join conditions order.
one’s, then the environment gives a reward by interacting
with PostgreSQL DBMS. This process is repeated until the B. JOIN ORDERING USING DEEP REINFORCEMENT
terminal state. Finally, the function Q(s, a) in the Q-table LEARNING
using this equation is updated as follows 3 [5], [42]: The join ordering using reinforcement learning model has
many limitations that need to be solved before considering
Q(s, a) = Q(s, a) + α[r(s, a) + γ maxQ(s0 , a) − Q(s, a)] it as a practical solution. The main problem is related to
(3) the size of the database schema and the number of tables.
A large database schema with many join conditions leads to
The first part Q(s, a) is the current value in the state (s) if an gigantic state space that may reach up to millions of states.
action (a) is taken, and the second part is the learning rate (α) Consequently, the Q-table needs a large amount of memory
multiplied by the TD error. which is the difference between to store. In addition, the exploration of the Q-table won’t be
the TD target and the current Q(s, a). with the following three efficient. Another limitation is related to the generalization
essential steps: as the Q-table model can’t infer a Q value of a new state
1) The agent begins in a state (s), acts (a), observe the next from the already trained states. Thus, join ordering using
state s0 , and reward r. deep reinforcement learning model was introduced to address
2) The agent chooses an action by referring to the Q-table those limitations.
with the greatest value for the next state(s0 ) Join ordering using deep reinforcement learning model
3) Update q-values. introduces a deep neural network to approximate the Q-value
Example 3: Consider the customer-ordering database function. The state is given as input and the Q-value of all
shown in Figure 1 that has 4 join conditions [CUSTOMER FG possible actions is generated as output as shown in Figure 6.
ADDRESS, CUSTOMER FG ORDER, ORDER FG PROD- Similar to any deep neural network, it uses coefficients to
UCT, PRODUCT FG CATEGORY] which is vectorized as approximate the function that maps an input to the out-
[1,1,1,1]. In the training phase, the agent tries to explore all put. Accordingly, the algorithm learns the right coefficients
possible execution plans. First, the agent explores each join by adjusting their values iteratively in the learning state.
condition individually, then it trains on the vector [1, 0, 0, 0] In the proposed model, weights of the deep neural network
and builds the first query with [CUSTOMER FG ADDRESS]. are updated during training instead of updating the Q-value
The environment interacts with the DBMS to get the actual directly in the Q-table.
execution time of this query in addition to [ADDRESS FG The proposed model uses the Deep Q Network (DQN)
CUSTOMER] query execution time to update the Q-table. The which uses a neural network to approximate the Q-value
model performs the same process for each join condition function to tell the agent what action to take. This model
individually. Then, the model trains on all possible two join was proposed in DeepMind’s paper [4] to learn policies
conditions with each other. For example, it will train on from high-dimensional sensory input using reinforcement
the vector [1, 1, 0, 0] for [CUSTOMER FG ADDRESS] and learning. As stated in [42], RL is known to be unstable or
[CUSTOMER FG ORDER]. In this case, the model trains on even diverge when neural networks are used to represent the
the best join order out of the following two join orders. action-values. There are various factors that lead to this insta-
The first join order consists of [CUSTOMER FG ADDRESS] bility: The presence of correlations in the sequence of obser-
and the better order from [CUSTOMER FG ORDER] and vations and the correlations between the action-values(Q)
[ORDER FG CUSTOMER] where the second join order is and the target values. In the proposed model we followed
[ADDRESS FG CUSTOMER] with the better join order from the following improvements on DeepMind’s model presented
[CUSTOMER FG ORDER] and [ORDER FG CUSTOMER]. in [4], [42] to tackle these issues:
Where, the best from [[CUSTOMER FG ORDER] and 1) Experience Replay: buffer replay was used to store
[ORDER FG CUSTOMER] joins is already discovered during the latest N experience tuples observed by an agent,
the previous cycle of training. This process is repeated until including state, action, reward ‘‘response time’’, and
the training process is terminated by training the whole next state which allows the network to reuse this data
[1, 1, 1, 1] vector. later by sampling from it randomly. During the training

70508 VOLUME 10, 2022


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

model, the same training process is applied but the neural


network is employed as an approximation function and is
used instead of the Q-table to predict the reward. The query
vector represents the input of the neural network and the
actual query execution time is the target value. The weights
of the neural network are modified to minimize the error
between the predicted value and the target value. This process
is repeated until the training process is terminated.
A four-layer feed-forward neural network with 30 neurons
in each hidden layer is used, with the number of input layer
neurons equal to the number of possible joins in the database
and the number of output layer neurons also equal to the num-
ber of possible joins. The input is fed as vector of integers and
the output is represented as a vector of integers. For example,
when the model is required to find the best execution plan
FIGURE 6. DQN model. for a query of three join conditions in a database with five
possible joins, the network is fed a vector with these condi-
tions represented by ones and others represented by zeros,
phase, the model uses random training samples from such as [1, 1, 0, 1, 0]. The neural network output is expected
the replay data as input, which leads to more effi- to be a vector with the reward value for each condition which
cient use of previous experiences and helps to reduce guides the agent to choose which join condition to perform.
database calls by using a buffer rather than calling the The agent will apply the condition with the maximum reward.
engine for the same queries again. The selected join condition, which is coded as one in the input
2) Target Network: the Bellman equation provides us with vector, will be replaced by zero in the input vector and the new
the value of Q(s, a) via Q(s0 , a) as the following input vector will be fed to the network in order to identify the
equation 4: next join condition. This process will be repeated until the
terminal state vector [0, 0, 0, 0, 0 ] is reached. The discount
Q(s, a) = Q(s, a)+[r(s, a)+γ maxQ(s0 , a) − Q(s, a)] factor is set to 0.9 and an Adam optimizer [44] is sued with a
(4) learning rate of 0.001.

In deep Q-learning, we need to minimize the mean V. EVALUATION


squared error between the target Q value (TD target) The objective of the evaluation is to assess the quality of the
and the current output which is called TD error and execution plans produced by the proposed models against
we need to estimate the TD target using the following the execution plans produced by PostgreSQL DBMS as a
equation 5 [5], [42]: baseline in order to prove the proposed model effectiveness.
Q(s, a) = r(s, a) + γ maxQ(s0 , a) (5) Towards this goal, the performance of the proposed models
are evaluated and compared with the results of PostgreSQL
When the parameters of our Neural Networks are DBMS query optimizer. The models were evaluated on two
changed to get Q(s, a) closer to the intended result, the databases, a real database that is used as a benchmark dataset
value produced for Q(s0 , a0 ) can be changed indirectly for the join-ordering problem [45] and a synthetic database
which can make our training very unstable. So, the that is used in TPC-H benchmark with various sizes to test
target network was introduced to stabilize the learning different scaling factors [46].
process [42].
In the proposed model, a separate network with fixed parame- A. EXPERIMENTS SETUP
ters was used to estimate Q-targets. At every step, the param- All experiments are conducted using a laptop running Ubuntu
eters are copied from the DQN network to a separate target version 18.04.3 LTS with an 8-core Intel Core i7-8550U
network to estimate the Q-targets. Similar to the first model, and 8 GB of RAM. Memory was set as available per oper-
the deep-reinforcement learning model takes the database ator (work_mem) to 512MB and the size of the buffer
schema and possible join orders to explore all possible queries (shared_buffer) was set to 1 GB. Models were implemented
in the training process. Assume the same customer-ordering using Python, Tensorflow [47] and Keras [48] library to
database presented in Figure 1 to train reinforcement learn- implement the neural networks. For the DQN model, Adam
ing model presented in the previous section. As presented query optimizer [44] and the RelU activation function were
in Example 3, it’s required to train the model 4 join used. In addition, PostgreSQL (v10.13) was used and the join
conditions database [CUSTOMER FG ADDRESS, CUSTO- collapse limit parameter was set to 1 at run-time to force
MER FG ORDER, ORDER FG PRODUCT, PRODUCT FG the planner to follow the join order. setting it to 1 prevents
CATEGORY] which is encoded as vector [1,1,1,1]. In this any reordering of explicit Joins. Thus, the explicit join order

VOLUME 10, 2022 70509


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

FIGURE 8. Simplified ER diagram of the TPC-H database.

FIGURE 7. Simplified ER diagram of the IMDB database.


Because the model generates all potential execution plans
for the generated query during the training process to learn a
better plan than the traditional Postgresql model plan, some
specified in the query will be the actual order in which the of the generated plans may be significantly worse than the
relations are joined. PostgreSQL plan and take hours to complete. These plans
Databases: are ineffective and inefficient execution plans from which the
1) IMDb (Internet Movie Database) [49] is an online model shouldn’t learn. To overcome this issue, we cut off and
real-world database containing a large amount of infor- discard any query plan that takes execution time greater than a
mation about films, television programs, and home configurable maximum time which is set to 3 minutes in our
videos which is available for non-commercial use that experiments because using PostgreSQL’s traditional model,
is used as a benchmark for the join ordering prob- all generated training queries in the experiments take less than
lem in article [45]. During the training process, only 3 minutes.
the tables shown in Figure 7 was used where the In order to perform training and testing of the two rein-
average number of records per table is greater than forcement learning proposed models, all possible number of
6 million records. According to figure 2 in the paper join ordering queries for each database schema were gen-
[45], a typical query graph has five relationships with erated where each query consists of a different number of
the ‘‘title’’ table, which is the center table for the joins. Each query was encoded in vector format as shown in
research workload. As a result, we chose all joins Table 1 in order to use it during training and testing phases
operations related to the ‘‘title’’ table to show us the where the vector encodes join conditions between different
query optimizer problem in join ordering, which are as relations.
follows: (movie_companies FG title, title FG kind_type, The proposed models performance are compared to Post-
movie_info FG title, aka_title FG title, movie_link FG greSQL optimizer using their generated plans execution time.
title, title FG complete_cast) In addition, the planning time required to generate the plans
2) TPC-H Database [46]: Using the retail database exist- by all models are collected during experiments. For example,
ing in the TPCH benchmark which contains a large the proposed models generated a plan for the query men-
amount of information about customers, orders, line- tioned in Table 1 that was executed in 60 seconds while
items, parts, part suppliers, suppliers, nations, and the traditional optimizer plan was executed in 100 seconds.
regions. TPC-H continues to be the most widely used The optimizer decided to start by joining [MOVIE_INFO FG
benchmark for relational systems and most join queries TITLE] which took a larger execution time and led to a
operate on three tables or more [50]. lineitem and very large intermediate results which was around 29 million
orders, which carry around 83 percent of the total records. On the other hand, the proposed models plan chose to
data, are the most challenging and largest tables in start by joining [MOVIE_COMPANIES FG TITLE] which
this schema. As a result, all lineitem, orders and cus- led to less than 5 million intermediate results. In addition,
tomer possible join operations are generated, totaling the proposed models required around 3 milliseconds to gen-
7 tables, to demonstrate the join ordering challenge, erate their plans while the PostgreSQL optimizer required
which are as follows: (customer FG nation, lineitem FG 25 milliseconds.
part, lineitem FG supplier, lineitem FG orders, nation FG Another experiment was conducted to evaluate the model
region, orders FG customers). Our TPCH experiments while using a database workload during training phase instead
use a database scale factor of 1 which Consists of the of training the model on generated queries. This experiment
base row size (equals 1GB raw data). The simplified uses the Join Order Benchmark (JOB), a collection of queries
ER diagram for entities without attributes is shown in used as a benchmark for the join ordering problem in arti-
Figure 8. cle [45]. Each query in the benchmark joins between four and

70510 VOLUME 10, 2022


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

TABLE 1. A query example and its encoded vector.

FIGURE 10. Comparison between the results of reinforcement learning


models with PostgreSQL on IMDB using IQR.

FIGURE 9. Comparison between the results of reinforcement learning


models with PostgreSQL on IMDB database.

FIGURE 11. Comparison between the results of reinforcement learning


models with PostgreSQL on TPCH database.
seventeen relations. Similar to the ReJOIN paper [29], the
same 10 queries were utilized to test the suggested model. The
last experiment was conducted to assess the generalization 2) TPC-H Database Experiment
ability of the proposed models. The proposed models were Applying the proposed models on the TPCH database
trained on 80% of the queries where the remaining 20% shows that the proposed models outperform Post-
were left for testing as unseen queries. The DQN model is greSQL in 27% of queries and failed in 8% with very
compared against Q-Learning based model on the test set of small differences in execution time as shown in Fig-
unseen queries where the execution times are collected for ure 11. During the evaluation, the optimizer failed to
each query plan that is generated by each model. finish one query within the maximum allowed time
where the reinforcement learning-based models suc-
B. EXPERIMENTAL RESULTS ceeded. As shown in Figure 12 interquartile range
1) IMDb Database Results (IQR), maximum execution time, mean execution time
As shown in Figure 9 where the X-axis represents the (defined by X mark in the graphs) of the PostgreSQL is
query IDs and Y-axis represents the query execution larger than the reinforcement learning-based proposed
times in milliseconds, the proposed models outper- models.
form PostgreSQL optimizer in 30% of the queries and Another important observation from Figure 9 and
failed in 3% otherwise both provide the same exe- Figure 11 is that Q-Learning model results are very
cution plans. For one query, the optimizer failed to close to DQN model results in all queries.
finish it with the maximum time of 3 minutes and the 3) Join Order Benchmark Queries Results
query execution was halted where the reinforcement The proposed model discovered execution plans that
learning-based models which finished before the maxi- outperform PostgreSQL in 70% of total queries
mum time. In Figure 10 the interquartile range (IQR) is in the benchmark queries and failed in 8% with
used to measure the variability between different mod- small differences in execution time. Nevertheless,
els using all queries performance on each model. The both provide the same execution plans. As shown in
figure shows that the interquartile range (IQR), maxi- Figure 13 and Figure 14, the interquartile range (IQR)
mum execution time, mean execution time (defined by of PostgreSQL’s mean execution time (shown by
X mark in the graphs) of the PostgreSQL is larger than the X mark in graphs) is greater than that of the
the reinforcement learning-based proposed models. DQN model. Using the proposed DQN model, the

VOLUME 10, 2022 70511


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

FIGURE 14. Comparison between the results of the DQN model with
PostgreSQL on join order benchmark 113 queries using IQR with showing
FIGURE 12. Comparison between the results of reinforcement learning extreme outliers.
models with PostgreSQL on TPCH database using IQR.

FIGURE 15. The percentage by which the DQN model outperformed


PostgreSQL on the join order benchmark test queries.

FIGURE 13. Comparison between the results of the DQN model with
PostgreSQL on join order benchmark 113 queries using IQR without
showing extreme outliers.

average execution time of queries in the benchmark is


8670 ms when the average execution time for Post-
greSQL is 16479 ms. In addition, the DQN model
outperforms PostgreSQL exceptionally in a subset of
queries. For Example, Query ‘‘17b’’ in the benchmark
queries requires 50000 ms while using the plan gen-
erated by the DQN model unlike the plan generated
by PostgreSQL which requires 140000 ms. In com- FIGURE 16. Comparison between the planning time of reinforcement
parison to the PostgreSQL optimizer, the DQN model learning models with PostgreSQL.
generates query plans that are, on average, 27% less
expensive for the 113 queries in the benchmark queries.
In addition, The model was tested using the same 4) Planning Time
ten queries that were used in ReJOIN [29] as shown In Figure 16, the relationship between the planning
in Figure 15 and demonstrated that the model pro- time and the number of join conditions for both models
vided join ordering plans that were 27% cheaper than is shown where the planning time increases as the num-
those generated by the PostgreSQL optimizer Which ber of join conditions increases in PostgreSQL DBMS
is superior to the ReJOIN approach, which provides while the planning time for Q-learning models is almost
on average 20% improvement. In addition, for a query constant.
(16b), the plan generated by the DQN model is 60% 5) Generalization Assessment
less expensive than PostgreSQL generated plan. How- Q-learning is built around the Q-table where it can
ever, the ReJOIN generated plan is just 20% less predict the best actions only for the states that are used
expensive. during the model training phase and doesn’t generalize

70512 VOLUME 10, 2022


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

FIGURE 17. Comparison between the results of the DQN and the
Q-learning models on IMDB database for new queries.

FIGURE 18. Training curve showing average penalty per episodes during
the training process on IMDB database.

for queries that haven’t been seen. In the proposed


model, The Q-table has all possible join conditions in
the database schema, and for states that have never been reinforcement learning based model provides a great
seen before, the Q-table will reward all actions equally. flexibility to run the proposed models on real envi-
As a result, in the case of Q-table, the model will choose ronments rather than the restrictions presented in the
any action at random from the set of possible join Q-Table of the reinforcement learning model.
conditions. On the other hand, DQN depends on a deep 6) Training Overhead
neural network that transforms the state’s information In supervised learning, tracking the model perfor-
and their best actions into neurons learned weights. mance and adjusting it during training can be done
Thus, the DQN model is expected to be able to take using a validation set. On the other hand, tracking
an action for states that were never seen before given reinforcement-based model performance during train-
their similarity with the states used previously during ing can be a challenging task [4]. Reinforcement-
the training phase. To assess the DQN model gener- based model training is tracked using the average
alization ability, 20% of the generated queries were penalty applied during different training episodes.
chosen as test data to cover a variety of queries with Figure 18 shows the evolution of the cost function
varying numbers of joins operations, which included during the training process of the DQN model. The
four queries with two joins operations, two with three figure shows how the total cost decreases during train-
joins, four with four joins, and one with five joins. ing on the IMDB Database. During training, it was
The model was trained on the remaining queries on found that 1000 iterations would require 30 minutes
the IMDB Database. As shown in Figure 17, the DQN assuming that the actual response time of each query
model outperforms the Q-learning model in 55% of plan exists in the buffer replay. The network was able
queries, however, it loses in 18% of queries. Otherwise, to learn from previous experience by using the buffer
the two models provide the same execution times. The reply which was used to store the latest N experi-
experiment shows clearly the ability of the DQN model ence tuples observed by an agent where each expe-
to generalize to queries states that were never seen rience tuple includes state, action, reward (response
before unlike the Q-learning model that depends on time), and next state. Once the data is stored in the
randomization to generate plans for unseen queries. buffer reply, the network can utilize it when required
Although Q-learning is better than DQN in Queries 5, through learning without having to interact with the
6, there’s no guarantee to provide the same plans every database management system again. This feature has
time the same experiment is conducted as it’s based on a significant impact on the training time. For example,
random choices. storing 1000 states in buffer replay with their rewards
Generally, the results show clearly the effectiveness (response times) prevents us from calling the database
of the proposed models against PostgreSQL DBMS management system again to retrieve the response
optimizer where execution plans produced by the pro- time for each of the 1000 state. More specifically,
posed models needed less execution time and produced if the average response time for a query in each of
smaller intermediate results than the execution plans the 1000 states in the buffer reply is 30 seconds, the
produced by PostgreSQL DBMS. total required time to retrieve their response times
In addition, the deep reinforcement learning based is 8 hours. Preserving the states data in the buffer
model had performed equally as the regular reinforce- reply would save 8 hours of training time in case the
ment learning model and was able to generalize for states response times were required twice during the
new and unseen queries during the training phase. Deep training.

VOLUME 10, 2022 70513


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

VI. CONCLUSION [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. D. Driessche,


The process of finding the optimal join order is a complex J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, ‘‘Mas-
tering the game of Go with deep neural networks and tree search,’’ Nature,
problem as the number of possible join combinations that the vol. 529, no. 7587, pp. 484–489, 2016.
optimizer needs to explore increases exponentially with the [10] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, ‘‘Deep reinforcement
number of tables that would be impossible for the optimizer to learning framework for autonomous driving,’’ Electron. Imag., vol. 29,
no. 19, pp. 70–76, Jan. 2017.
analyze costs for all possible combinations. So, most optimiz- [11] V. Talpaert, I. Sobh, B. Kiran, P. Mannion, S. Yogamani, A. El-Sallab, and
ers use heuristics rules to prune the search space which helps P. Perez, ‘‘Exploring applications of deep reinforcement learning for real-
the optimizer to balance between the optimization time and world autonomous driving systems,’’ in Proc. 14th Int. Joint Conf. Comput.
Vis., Imag. Comput. Graph. Theory Appl., 2019, pp. 564–572.
the plan quality. Besides, the optimizer uses pre-calculated [12] S. Ibrahim and D. Nevin, ‘‘End-to-end framework for fast learning asyn-
statistics to estimate the cost of a plan that may mislead the chronous agents,’’ in Proc. 32nd Conf. Neural Inf. Process. Syst., Imitation
optimizer to choose an inefficient plan. The planning time Learn. Challenges Robot. Workshop (NIPS), 2018. [Online]. Available:
https://sites.google.com/view/nips18-ilr#h.p_6wGpM-tJnQIU
for traditional models increases as the number of join condi-
[13] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and
tions increases. In this paper, we presented A Reinforcement T. G. Price, ‘‘Access path selection in a relational database management
Learning Based Query Optimizer (RL_QOptimizer), a new system,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data (SIGMOD),
1979, pp. 23–34.
approach to recommend a query execution plan focusing on
[14] P. M. G. Apers, A. R. Hevner, and S. B. Yao, ‘‘Optimization algorithms for
join ordering problems using reinforcement learning-based distributed queries,’’ IEEE Trans. Softw. Eng., vol. SE-9, no. 1, pp. 57–68,
models. As the query execution time is the most crucial Jan. 1983.
requirement from any query engine, execution time was used [15] N. Bruno and S. Chaudhuri, ‘‘Exploiting statistics on query expressions
for optimization,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data
as a reward for the proposed reinforcement learning mod- (SIGMOD), 2002, pp. 263–274.
els. The achieved results of the performance evaluation show [16] S. Christodoulakis, ‘‘Estimating selectivities in data bases,’’ Univ. Toronto,
that the RL_QOptimizer outperforms PostgreSQL in choos- Toronto, ON, Canada, Tech. Rep. CSRG-136, 1982.
[17] Y. E. Ioannidis, R. T. Ng, K. Shim, and T. K. Sellis, ‘‘Parametric query
ing the best join orders for many queries and show how the optimization,’’ VLDB J. Int. J. Very Large Data Bases, vol. 6, no. 2,
Learning-based models can save query planning time. Also, pp. 132–151, May 1997.
the deep reinforcement learning model achieves very close [18] R. L. Cole and G. Graefe, ‘‘Optimization of dynamic query evaluation
plans,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data (SIGMOD), 1994,
results to the Q-learning model with a key advantage that it is pp. 150–160.
more suitable for large number of states and actions as it uses [19] S. Ganguly, ‘‘Design and analysis of parametric query optimization algo-
a neural network to approximate the query execution time. rithms,’’ in Proc. 24rd Int. Conf. Very Large Data Bases. New York, NY,
In addition, the experiments show the ability of the DQN USA, Aug. 1998, pp. 228–238.
[20] A. Hulgeri and S. Sudarsh, ‘‘Parametric query optimization for linear and
model to generalize to queries states that were never seen piecewise linear cost functions,’’ in Proc. 28th Int. Conf. Very Large Data
before during training phase which is another key advantage Bases, Aug. 2002, pp. 167–178.
for deep reinforcement learning model. [21] P. Bizarro, N. Bruno, and D. J. DeWitt, ‘‘Progressive parametric query
optimization,’’ IEEE Trans. Knowl. Data Eng., vol. 21, no. 4, pp. 582–594,
The current models have a limited scope in that they can Apr. 2009.
only handle join-only queries with no extra predicates. The [22] N. Reddy and J. R. Haritsa, ‘‘Analyzing plan diagrams of database query
study focuses on Join Ordering only because it is one of optimizers,’’ in Proc. VLDB Endowment, Aug. 2005, pp. 1228–1239.
[23] B. J. Oommen, ‘‘The efficiency of histogram-like techniques for database
the most difficult problems in query optimization [2], and query optimization,’’ Comput. J., vol. 45, no. 5, pp. 494–510, May 2002.
the study aims to demonstrate the concept of using Rein- [24] V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita, ‘‘Improved
forcement Learning in Query Optimizer to replace the cost histograms for selectivity estimation of range predicates,’’ in Proc. ACM
SIGMOD Int. Conf. Manage. Data (SIGMOD), 1996, pp. 294–305.
model by focusing on execution time, which will pave the
[25] X. Lu and J. Guan, ‘‘A new approach to building histogram for selectivity
way for more research to create a comprehensive end-to-end estimation in query processing optimization,’’ Comput. Math. With Appl.,
Reinforcement Learning Optimizer. vol. 57, no. 6, pp. 1037–1047, Mar. 2009.
[26] J. Zahir and A. E. Qadi, ‘‘A recommendation system for execution plans
REFERENCES using machine learning,’’ Math. Comput. Appl., vol. 21, no. 23, p. 23,
[1] R. Elmasri and N. Shamkant, Fundamentals of Database Systems. Jun. 2016.
San Francisco, CA, USA: Benjamin-Cummings Publishing, 2001. [27] M. Stillger, G. M. Lohman, V. Markl, and M. Kandil, ‘‘LEO-DB2’s
[2] B. Nevarez, Inside the SQL Server Query Optimizer, C. Massey, Ed. learning optimizer,’’ in Proc. 27th Int. Conf. Very Large Data Bases.
Salford, U.K.: Hight Performance SQL Server, Simple Talk, 2010. San Francisco, CA, USA: Morgan Kaufmann, Sep. 2001, pp. 19–28.
[3] S. Chaudhuri, ‘‘An overview of query optimization in relational systems,’’ [28] H. Liu, M. Xu, Z. Yu, V. Corvinelli, and C. Zuzarte, ‘‘Cardinality estima-
in Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. Princ. Database tion using neural networks,’’ in Proc. 25th Annu. Int. Conf. Comput. Sci.
Syst. (PODS), 1998, pp. 34–43. Softw. Eng. Riverton, NJ, USA: IBM Corp., Nov. 2015, pp. 53–59.
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, [29] R. Marcus and O. Papaemmanouil, ‘‘Deep reinforcement learning for join
D. Wierstra, and M. Riedmiller, ‘‘Playing Atari with deep reinforcement order enumeration,’’ in Proc. 1st Int. Workshop Exploiting Artif. Intell.
learning,’’ in Neural Inf. Process. Syst. Workshop, 2013, pp. 1–9. Techn. Data Manage., Jun. 2018, pp. 1–4.
[5] C. J. C. H. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8, [30] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi, ‘‘Learning state
nos. 3–4, pp. 279–292, 1992. representations for query optimization with deep reinforcement learning,’’
[6] S. Vellevt, ‘‘Review of algorithms for join ordering problem in database in Proc. Workshop Data Manage. End End Mach. Learn., no. 4, Jun. 2018,
query optimization,’’ Inf. Technol. Control, vol. 1, pp. 1312–2622, pp. 1–4.
Jan. 2009. [31] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska,
[7] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning. O. Papaemmanouil, and N. Tatbul, ‘‘Neo: A learned query optimizer,’’
Cambridge, MA, USA: MIT Press, 1998. Proc. VLDB Endowment, vol. 12, no. 11, pp. 1705–1718, Jul. 2019.
[8] R. Bellman, ‘‘A Markovian decision process,’’ Indiana Univ. Math. J., [32] O. Ivanov and S. Bartunov, ‘‘Adaptive cardinality estimation,’’ 2017,
vol. 6, no. 4, pp. 679–684, Apr. 1957. arXiv:1711.08330.

70514 VOLUME 10, 2022


M. Ramadan et al.: RL_QOptimizer: A Reinforcement Learning Based Query Optimizer

[33] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica, ‘‘Learn- AYMAN EL-KILANY received the M.Sc. and
ing to optimize join queries with deep reinforcement learning,’’ 2018, Ph.D. degrees from the Information Systems
arXiv:1808.03196. Department, Faculty of Computers and Artificial
[34] K. Tzoumas, T. Sellis, and C. S. Jensen, ‘‘A reinforcement learning Intelligence, Cairo University, in 2012 and 2018,
approach for adaptive query processing,’’ Inst. Datalogi, Aalborg respectively. He is currently an Assistant Professor
Universitet, Aalborg, Denmark, 1DB Tech. Rep. 22, 2008. [Online]. and a Researcher at the Faculty of Computers and
Available: https://vbn.aau.dk/en/publications/a-reinforcement-learning- Artificial Intelligence, Cairo University.
approach-for-adaptive-query-processing
[35] R. B. Guo and K. Daudjee, ‘‘Research challenges in deep reinforce-
ment learning-based join query optimization,’’ in Proc. 3rd Int. Workshop
Exploiting Artif. Intell. Techn. Data Manage., Jun. 2020, pp. 1–6.
[36] R. Marcus and O. Papaemmanouil, ‘‘Towards a hands-free query optimizer
through deep learning,’’ in Proc. 9th Biennial Conf. Innov. Data Syst. Res.,
(CIDR), 2019, pp. 1–8. HODA M. O. MOKHTAR received the B.Sc.
[37] I. Trummer, J. Wang, D. Maram, S. Moseley, S. Jo, and J. Antonakakis,
(Hons.) and M.Sc. degrees from the Department
‘‘SkinnerDB: Regret-bounded query evaluation via reinforcement learn-
of Computer Engineering, Faculty of Engineering,
ing,’’ in Proc. Int. Conf. Manage. Data, Jun. 2019, pp. 1153–1170.
[38] J. Heitz and K. Stockinger, ‘‘Join query optimization with deep reinforce- Cairo University, in 1997 and 2000, respectively,
ment learning algorithms,’’ 2019, arXiv:1911.11689. and the Ph.D. degree in computer science from the
[39] R. Marcus, P. Negi, H. Mao, N. Tatbul, M. Alizadeh, and T. Kraska, ‘‘Bao: University of California at Santa Barbara, in 2005.
Making learned query optimization practical,’’ in Proc. Int. Conf. Manage. She is currently the Dean of the Faculty of Com-
Data, Jun. 2021, pp. 1275–1288. puting and Information Sciences, Egypt University
[40] A. Sharma, F. M. Schuhknecht, and J. Dittrich, ‘‘The case for auto- of Informatics. Before being the Dean, she was
matic database administration using deep reinforcement learning,’’ 2018, the Chair of the Information Systems Department,
arXiv:1801.05643. Faculty of Computers and Artificial Intelligence, Cairo University. In 2000,
[41] J. Wang, I. Trummer, and D. Basu, ‘‘UDO: Universal database optimization she was awarded a scholarship and the Dean’s Fellowship from the Com-
using reinforcement learning,’’ Proc. VLDB Endowment, vol. 14, no. 13, puter Science Department, UCSB. She taught multiple courses both for
pp. 3402–3414, Sep. 2021. the undergraduate and graduate levels at the Faculty of Computers and
[42] V. Mnih et al., ‘‘Human-level control through deep reinforcement learn- Artificial Intelligence, Cairo University, where she has supervised a number
ing,’’ Nature, vol. 518, pp. 529–533, Feb. 2015. of master’s and Ph.D. theses at the Faculty of Computers and Artificial
[43] (1996). Postgresql. [Online]. Available: https://www.postgresql.org/
Intelligence. She has participated in several national committees, and was
[44] D. P. Kingma and J. B. Adam, ‘‘A method for stochastic optimization,’’ in
awarded multiple awards and certificates for her academic achievements. Her
Proc. 3rd Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA, 2015,
pp. 1–15. research interests include big data analytics, data warehousing, data mining,
[45] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann, database systems, social network analysis, bioinformatics, and web services.
‘‘How good are query optimizers, really?’’ Proc. VLDB Endowment, vol. 9,
no. 3, pp. 204–215, Nov. 2015.
[46] (1993). TPC-H. [Online]. Available: http://www.tpc.org/tpch/
[47] (2015). Tensorflow. [Online]. Available: https://www.tensorflow.org
[48] (2015). Keras. [Online]. Available: https://keras.io/
[49] (1990). Imdb. [Online]. Available: https://www.imdb.com/ IBRAHIM SOBH received the B.Sc. and M.Sc.
[50] M. Dreseler, M. Boissier, T. Rabl, and M. Uflacker, ‘‘Quantifying TPC-H degrees in computer engineering from the Fac-
choke points and their optimizations,’’ Proc. VLDB Endowment, vol. 13,
ulty of Engineering, Cairo University, and the
no. 8, pp. 1206–1220, Apr. 2020.
Ph.D. degree in deep reinforcement learning for
fast learning agents acting in 3D environments.
Currently, he is a Senior Expert of AI at Valeo.
He has more than 20 years of experience in the
MOHAMED RAMADAN received the B.Sc. area of machine learning and software develop-
degree from the Information Systems Department, ment. His M.Sc. thesis is in the field of machine
Faculty of Computers and Artificial Intelligence, learning applied on automatic documents summa-
Cairo University, in 2016. He is currently a Teach- rization. He has participated in several related national and international
ing Assistant and a Researcher at the Faculty of mega projects, conferences and summits. He delivers training and lectures
Computers and Artificial Intelligence, Cairo Uni- for academic and industrial entities. His publications including international
versity. He has six years of experience in the area journals and conference papers are mainly in the machine and deep learning
of software development. fields. His research interests include computer vision, natural language
processing, and speech processing.

VOLUME 10, 2022 70515

You might also like