Hive Vectorized Query Execution Design
Hive Vectorized Query Execution Design
Hive query execution engine currently processes one row at a time. A single row
of data goes through all the operators before next row can be processed. This mode
of processing is very inefficient in terms of CPU usage. Research has demonstrated
that this yields low instruction counts and effective processor pipeline utilization
(high instructions per cycle) [Boncz 2005]. Also currently Hive heavily relies on lazy
deserialization and data columns go through a layer of object inspectors that
identify column type, de-serialize data and determine appropriate expression
routines in the inner loop. These layers of virtual method calls further slow down
the processing.
The Hive vectorization project aims to remove the above-mentioned inefficiencies
by modifying the execution engine to work on vectors of columns. The data rows
will be batched together and represented as a set of column vectors and the
processor inner loop will work on one vector at a time. This approach has been
proved to be not only cache friendly but also achieves high instructions per cycle by
effective use of pipelining in modern superscalar CPUs.
This project will also closely integrate the Hive execution engine with the ORC file
format removing the layers of deserialization and type inference. The cost of eager
deserialization is expected to be small because filter push down will be supported
and the vectorized representation of data will consist of arrays of primitive types as
much as possible. We will design a new generic vectorized iterator interface to get a
batch of rows at a time, with a column vector for each batch, from the storage layer.
This will be implemented first for ORC, but we envision that it will later be
implemented for other storage formats like text, sequence file, and Trevni. This will
enable the CPU-saving benefits of vectorized query execution on these other formats
as well.
The following sections describe the scope of changes and key design choices in this
project.
1. Basic Idea
The basic idea is to process a batch of rows as an array of column vectors. A
basic representation of vectorized batch of rows can be as follows:
class VectorizedRowBatch {
boolean selecetedInUse;
int [] selected;
int size;
ColumnVector [] columns;
}
The selected array contains the indexes of the valid rows in the batch and
size represents the number of valid rows. An example of a ColumnVector is
as following:
class LongColumnVector extends ColumnVector {
long [] vector;
1
}
Now addition of a long column with a constant can be implemented as
following:
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector)batch.columnVector[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columnVector[outputColumn]).vector;
if (batch,selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
4. Boolean/Filter expressions
The vectorized batch row object will consist of an array of column vectors
and also an additional selection vector that will identify the remaining rows
in the object that have not been filtered out (refer to the data structure). The
filter operators can be very efficiently implemented by updating the selection
vector to identify the positions of only the selected rows. This approach
doesn’t require any additional output column. This also has a benefit of short
circuit evaluation of conditions for example a row already filtered out need
not be processed by any subsequent expression.
‘And’ expression in filter is very simple. It will always have two child
expressions each evaluating to Boolean. A filter ‘and’ can be implemented as
a sequential evaluation of its two child expressions where each is a filter in
itself. There is no logic needed for the ‘and’ expression itself apart from
evaluating its children.
The expression b+c+d will be identified, say as Expr0, at compile time, and
the vectorized row batch object will have in it the columns (a, b, c, d, Expr0).
8. Null Handling
Each column vector will also contain an array of Boolean that will mark the
index of rows with null values for that column. The expression code will also
have to handle the null values in the columns. There are many useful
optimizations possible for nulls.
If it is known that column doesn’t contain any null value, we can avoid
a null check in the inner loop.
Consider a binary operation between column c1 and column c2, and if
any operand is null the output is null. If it is known that c2 is never
null then the null vector of the output column will be same as the null
vector of the input column c1. In such a situation, it is possible to just
copy the null vector from c1 to output and skip the null check in the
inner loop. It may also be possible to get away with just the shallow
copy of the null vector.
9. Vectorized operators
We will implement vectorized versions for the current set of operators. The
vectorized operators will contain vectorized expressions; they will take
vectorized row batch as an input and will not use object inspectors to access
columns.
a. Filter Operator:
The filter operator will consist of just the filter condition expression,
which will be an in-place filter expression. Therefore, once this
condition is evaluated the data would already be filtered and will be
passed to the next operator.
b. Select Operator:
The select operator will consist of list of expressions that need to be
projected. The operator will be initialized with the index of the output
column for each child expression. The vectorized input will be passed
6
to each expression in the list, which will populate the output columns
appropriately. The select operator will output a different vectorized
row batch object, which will consist of the projected columns. The
projected columns would just refer the appropriate output columns in
the initial vectorized row batch object. The select operator will have a
pre-allocated vectorized row object that will be used as output.
References
[Boncz 2005] Peter Boncz et al., MonetDB/X100: Hyper-Pipelining Query Execution,
Proceedings of the 2005 CIDR Conference.