The following is documentation for the Reduction kernel within portBLAS.
Reduction is not a standard BLAS operation. It is provided as an extension in
portBLAS. It refers to an operation where some or all elements of a matrix
are reduced to a single scalar value via a binary operator. The supported
operators are:
AddOperatorAbsoluteAddOperatorProductOperatorMinOperatorMaxOperatorMeanOperator
Currently in portBLAS only partial reduction is supported. Unlike full
reduction, it reduces the columns or rows of a matrix to a single row or column
depending on the reduction dimension specified. The reduction dimension is
specified with the enumeration reduction_dim_t.
Below are examples of the output of partial reduction using AddOperator with
the input matrix
1 4 7
2 5 8
3 6 9
reduction_dim_t::inner- Computes partial reduction along the columns, i.e. the sum of each column. The output in this case is a row vector:
6 15 24
reduction_dim_t::outer- Computes partial reduction along the rows, i.e. the sum of each row of the input matrix. The output in this case is a column vector:
12
15
18
The CMake option BLAS_ENABLE_EXTENSIONS (ON by default) can be used to
enable/disable compilation of the Reduction operation.
Currently portBLAS supports a partial reduction kernel. Its implementation
can be found in
src/operations/extension/reduction.hpp.
There is a single Reduction kernel: Reduction. The main implementation of the
kernel is in Reduction::eval().
The algorithm is as follows:
- Load multiple values from global memory.
- Reduce them together using the reduction operator (
operator_t) - Store the result in local memory.
- Perform the reduction operation on the element in local memory with current local id and the corresponding element in the second half of local memory.
- Store the result in the appropriate part of the output vector.
Some optimizations that were implemented:
- Loop unrolling
- Using an offset when storing data in local memory to prevent bank conflicts
- Using template metaprogramming to compute as much as possible at compile time to avoid if-statements in the kernel