See recent articles
This paper addresses emulation algorithms for matrix multiplication. General Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware architectures. The Ozaki scheme is a well-established GEMM-based emulation method for matrix multiplication, wherein input matrices are decomposed into several low-precision components to ensure that the resulting matrix product is computed exactly through numerical operations. This study proposes a novel GEMM-based emulation method for matrix multiplication that leverages the Chinese Remainder Theorem. The proposed method inherits the computational efficiency of highly optimized GEMM routines and further enables control over the number of matrix multiplications, which can enhance computational accuracy. We present numerical experiments featuring INT8 Tensor Core operations on GPUs and FP64 arithmetic on CPUs as case studies. The results demonstrate that FP64 emulation using the proposed method achieves performance levels of up to 7.4 to 9.8 TFLOPS on the NVIDIA RTX 4090 and 56.6 to 80.2 TFLOPS on the NVIDIA GH200, exceeding the measured performance of native FP64 arithmetic. Furthermore, for FP64 computations on CPUs, the proposed method achieved up to a 2.3x speedup in emulating quadruple-precision arithmetic compared to the conventional Ozaki scheme.
Wave equations help us to understand phenomena ranging from earthquakes to tsunamis. These phenomena materialise over very large scales. It would be computationally infeasible to track them over a regular mesh. Yet, since the phenomena are localised, adaptive mesh refinement (AMR) can be used to construct meshes with a higher resolution close to the regions of interest. ExaHyPE is a software engine created to solve wave problems using AMR, and we use it as baseline to construct our numerical relativity application called ExaGRyPE. To advance the mesh in time, we have to interpolate and restrict along resolution transitions in each and every time step. ExaHyPE's vanilla code version uses a d-linear tensor-product approach. In benchmarks of a stationary black hole this performs slowly and leads to errors in conserved quantities near AMR boundaries. We therefore introduce a set of higher-order interpolation schemes where the derivatives are calculated at each coarse grid cell to approximate the enclosed fine cells. The resulting methods run faster than the tensor-product approach. Most importantly, when running the stationary black hole simulation using the higher order methods the errors near the AMR boundaries are removed.