Walkthrough: Code Generation for GPUs#
This document shall give an overview of the faclities of the pystencils code generation backend for the generation of GPU kernels for the CUDA and HIP platforms. It complements the previous Walkthrough: The Kernel Translation Process, so if you haven’t read that, we advise you take a look at it first.
Preparation: Setup and Parsing#
We’ll be using the same kernel as in the walkthrough for demonstration:
[Assignment(x, f_C), Assignment(f_C, c*x), MaxReductionAssignment(wmax, x)]
The setup and parsing steps are the same as in Context Setup and Parsing of the Kernel Body, and we arrive at the same IR for the kernel body:
{
x: float64 = _data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]];
_data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]] = c * x;
wmax_local = max(wmax_local, x);
}
Iteration Spaces via GPU Block/Thread Indexing#
For GPU, same as for CPU, we continue by manifesting the kernel’s index space as an axes cube via the ast factory, and canonicalizing it:
cube = factory.cube_from_ispace(ispace, body)
from pystencils.backend.transformations import CanonicalizeSymbols
canonicalize = CanonicalizeSymbols(ctx)
cube = canonicalize(cube)
ps.inspect(cube)
axes-cube(
range(ctr_2 : [[0: int64] : _size_f_2 - [0: int64] : [1: int64]]),
range(ctr_1 : [[0: int64] : _size_f_1 - [0: int64] : [1: int64]]),
range(ctr_0 : [[0: int64] : _size_f_0 - [0: int64] : [1: int64]])
)
{
x: const float64 = _data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]];
_data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]] = c * x;
wmax_local = max(wmax_local, x);
}
Now, instead of expanding this cube into a loop nest, we use GPU block and thread axes. We have three available axis types:
gpu_blockmaps the cube’s leading axis onto the current GPU block index (blockIdxin CUDA, HIP);gpu_threadparallelizes the next axis using the current GPU thread index (threadIdxin CUDA, HIP); andgpu_block_x_threadmaps the leading axis onto the typical combinationblockIdx * blockDim + threadIdx.
These axes may be freely combined with target-agnostic axes, like loop,
to construct a wide range of GPU iteration strategies.
Here’s a sample expansion strategy using all three axis types:
from pystencils.backend.transformations import AxisExpansion
ae = AxisExpansion(ctx)
strategy = ae.create_strategy(
[
ae.gpu_block("y"),
ae.gpu_thread("y"),
ae.gpu_block_x_thread("x")
]
)
kernel_ast = strategy(cube)
ps.inspect(kernel_ast)
{
gpu-block-axis< Y >(range(ctr_2 : [[0: int64] : _size_f_2 - [0: int64] : [1: int64]]))
{
gpu-thread-axis< Y >(range(ctr_1 : [[0: int64] : _size_f_1 - [0: int64] : [1: int64]]))
{
gpu-block-x-thread-axis< X >(range(ctr_0 : [[0: int64] : _size_f_0 - [0: int64] : [1: int64]]))
{
x: const float64 = _data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]];
_data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]] = c * x;
wmax_local = max(wmax_local, x);
}
}
}
}
Next, we invoke MaterializeAxes to turn the GPU iteration axes into actual code:
from pystencils.backend.transformations import MaterializeAxes
mat_axes = MaterializeAxes(ctx)
kernel_ast = mat_axes(kernel_ast)
ps.inspect(kernel_ast)
{
ctr_2: int64 = gpu.blockIdx.Y();
ctr_1: int64 = gpu.threadIdx.Y();
ctr_0: int64 = gpu.blockIdx.X() * gpu.blockDim.X() + gpu.threadIdx.X();
if(ctr_2 < _size_f_2 && ctr_1 < _size_f_1 && ctr_0 < _size_f_0)
{
x: const float64 = _data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]];
_data_f[ctr_0 + [0: int64], ctr_1 + [0: int64], ctr_2 + [0: int64], [0: int64]] = c * x;
wmax_local = max(wmax_local, x);
}
}
The resulting code uses IR functions blockIdx.Y(), threadIdx.Y(), etc., to define
the axis counters. Also, kernel guards are introduced to make sure the kernel body runs only
if the indices are inside the iteration space.
The axes materializer attempts to combine as many directly nested GPU axes as it can find
into a single guard.
Reductions to Memory#
At this point, the reduction to wmax needs to be completed by adding its memory-write-back code.
We introduce this via the ReductionsToMemory pass.
The write-back IR function will later be transformed into an optimized GPU reduction implementation
by the SelectFunctions pass.
from pystencils.backend.transformations import ReductionsToMemory
reduce_to_memory = ReductionsToMemory(ctx, ctx.reduction_data.values())
kernel_ast = reduce_to_memory(kernel_ast)
ps.inspect(kernel_ast)
Lowering#
At this point, the kernel AST yet contains IR features (GPU index functions, the reduction write-back) that must still be mapped onto target-specific implementations. To perform these lowerings, we must first set up a GPU platform object. Let’s use the CUDA platform in this case:
from pystencils.backend.platforms import CudaPlatform
platform = CudaPlatform(
ctx,
assume_warp_aligned_block_size=True,
warp_size=32
)
We have specified two optional parameters here: the GPU’s warp size (which is always 32 for CUDA),
and the assumption that the total size of a thread block at runtime will always be a multiple of two.
These assumptions allow the platform to emit optimized code for the reduction onto w, as we will see below.
So, let us run the SelectFunctions and LowerToC passes:
from pystencils.backend.transformations import SelectFunctions, LowerToC
lower_to_c = LowerToC(ctx)
kernel_ast = lower_to_c(kernel_ast)
select_functions = SelectFunctions(platform)
kernel_ast = select_functions(kernel_ast)
ps.inspect(kernel_ast)
Here you can see that the GPU indexing intrinsics were replaced by the corresponding CUDA thread and block index variables. Also, the reduction write-back was materialized using a warp-level reduction construct: First, shuffle instructions are used to perform a tree-reduction within the current warp, such that afterward the warp’s leading thread holds the warp’s reduction result, and only it needs to perform an atomic memory update.
Wrap Up the Kernel#
At the end, the finished kernel must be wrapped in a GpuKernel object, to be given to the JIT
compiler and runtime system.
We’re using the KernelFactory and GpuIndexing classes from the codegen module for this.
We define a factory for the kernel’s launch configurations using ManualLaunchConfiguration,
and set up an instance of the CupyJit just-in-time compiler.
from pystencils.codegen.driver import KernelFactory
from pystencils.codegen.gpu_indexing import GpuIndexing, ManualLaunchConfiguration
from pystencils.jit.gpu_cupy import CupyJit
def launch_config_factory() -> ManualLaunchConfiguration:
return ManualLaunchConfiguration(GpuIndexing.get_hardware_properties(target))
kfactory = KernelFactory(ctx)
kernel = kfactory.create_gpu_kernel(
platform,
kernel_ast,
"kernel",
ps.Target.CUDA,
CupyJit(),
launch_config_factory
)
ps.inspect(kernel, show_cpp=True)
__global__ void kernel (double * RESTRICT const _data_f, const int64_t _size_f_0, const int64_t _size_f_1, const int64_t _size_f_2, int64_t _stride_f_0, int64_t _stride_f_1, int64_t _stride_f_2, const double c, double * wmax)
{
double wmax_local = PS_FP64_NEG_INFINITY;
int64_t ctr_2 = (int64_t) blockIdx.y;
int64_t ctr_1 = (int64_t) threadIdx.y;
int64_t ctr_0 = (int64_t) blockIdx.x * (int64_t) blockDim.x + (int64_t) threadIdx.x;
if(ctr_2 < _size_f_2 && ctr_1 < _size_f_1 && ctr_0 < _size_f_0)
{
const double x = _data_f[ctr_0 * _stride_f_0 + ctr_1 * _stride_f_1 + ctr_2 * _stride_f_2];
_data_f[ctr_0 * _stride_f_0 + ctr_1 * _stride_f_1 + ctr_2 * _stride_f_2] = c * x;
wmax_local = fmax(wmax_local, x);
}
{
wmax_local = max(wmax_local, __shfl_xor_sync(0xffffffff, wmax_local, 16));
wmax_local = max(wmax_local, __shfl_xor_sync(0xffffffff, wmax_local, 8));
wmax_local = max(wmax_local, __shfl_xor_sync(0xffffffff, wmax_local, 4));
wmax_local = max(wmax_local, __shfl_xor_sync(0xffffffff, wmax_local, 2));
wmax_local = max(wmax_local, __shfl_xor_sync(0xffffffff, wmax_local, 1));
if(ctr_2 < _size_f_2 - 0LL && ctr_1 < _size_f_1 - 0LL && ctr_0 < _size_f_0 - 0LL && (threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * (blockDim.x * blockDim.y)) % 32 == 0)
{
pystencils::runtime::gpu::atomicMax(wmax, wmax_local);
}
}
}
Note on Launch Configurations#
A GPU kernel’s launch configuration can be pre-determined wholly or partly by the code generator,
or kept fully in the hands of the runtime system. These different types of launch configurations
are implemented by a set of classes in the pystencils.codegen.gpu_indexing module.
Since a concrete launch configuration is not specific to the kernel itself, but to the kernels’
invocation site, the code generator only attaches a factory function for launch configurations
to GpuKernel. It is up to the runtime system to locally instantiate and configure a launch configuration.
To determine the actual launch grid, the launch configuration must be evaluated at the kernel’s call site
by passing the required parameters to GpuLaunchConfiguration.evaluate
The CupyJit, for instance, will create the launch configuration object while preparing the JIT-compiled
kernel wrapper object. The launch config is there exposed to the user, who may modify some of its properties.
These depend on the type of the launch configuration:
while the AutomaticLaunchConfiguration permits no modification and computes grid and block size directly from kernel
parameters,
the ManualLaunchConfiguration requires the user to manually specifiy both grid and block size.
The DynamicBlockSizeLaunchConfiguration dynamically computes the grid size from either the default block size
or a computed block size. Computing block sizes can be signaled by the user via the trim_block_size or
fit_block_size member functions. These function receive an initial block size as an argument and adapt it.
The trim_block_size function trims the initial block size with the sizes of the iteration space, i.e. it takes
the minimum value of both sizes per dimension. The fit_block_size performs a block fitting algorithm that adapts
the initial block size by incrementally enlarging the trimmed block size until it is large enough
and aligns with the warp size.
The evaluate method can only be used from within a Python runtime environment.
When exporting pystencils CUDA kernels for external use in C++ projects,
equivalent C++ code evaluating the launch config must be generated.
This is the task of, e.g., pystencils-sfg.