Thomas Faignaert, Tim Besard, Bjorn De Sutter

General Matrix Multiplication or GEMM kernels take center place in high performance
computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as
NVIDIA’s Tensor Cores. In this paper we show how it is possible to program these
accelerators from Julia, and present abstractions and interfaces that allow to do so
efficiently without sacrificing performance.

A pre-print of the paper has been published on arXiv:
arXiv:2009.12263.

The source code can be found on
GitHub:
thomasfaingnaert/GemmKernels.jl.

With the APIs from GemmKernels.jl, it is possible to instantiate GEMM kernels that perform
in the same ball park as, and sometimes even outperform state-of-the-art libraries like
CUBLAS and CUTLASS. For example, performing a mixed-precision multiplication of two 16-bit
matrixes into a 32-bit accumulator (on different combinations of layouts):

Performance of mixed-precision GEMM

The APIs are also highly flexible and allow customization of each step, e.g., to apply the
activation function max(x, 0) for implementing a rectified linear unit (ReLU):

a = CuArray(rand(Float16, (M, K)))
b = CuArray(rand(Float16, (K, N)))
c = CuArray(rand(Float32, (M, N)))
d = similar(c)

conf = GemmKernels.get_config(
    gemm_shape = (M = M, N = N, K = K),
    operator = Operator.WMMAOp{16, 16, 16},
    global_a_layout = Layout.AlignedColMajor{Float16},
    global_c_layout = Layout.AlignedColMajor{Float32})

GemmKernels.matmul(
    a, b, c, d, conf;
    transform_regs_to_shared_d = Transform.Elementwise(x -> max(x, 0)))

The GemmKernels.jl framework is written entirely in Julia, demonstrating the
high-performance GPU programming capabilities of this language, but at the same time keeping
the research accessible and easy to modify or repurpose by other Julia developers.

Read More

ترك الرد

من فضلك ادخل تعليقك
من فضلك ادخل اسمك هنا