Pystencils for GPUs#
Pystencils offers code generation for Nvidia and AMD GPUs using the CUDA and HIP programming models, as well as just-in-time compilation and execution of CUDA kernels from within Python based on the cupy library. This section’s objective is to give a detailed introduction into the creation of GPU kernels with pystencils.
Note
CuPy is a Python library for numerical computations on GPU arrays, which operates much in the same way that NumPy works on CPU arrays. Cupy and NumPy expose nearly the same APIs for array operations; the difference being that CuPy allocates all its arrays on the GPU and performs its operations as CUDA kernels. Also, CuPy exposes a just-in-time-compiler for GPU kernels. In pystencils, we use CuPy both to compile and provide executable kernels on-demand from within Python code, and to allocate and manage the data these kernels can be executed on.
For more information on CuPy, refer to their documentation.
Generate, Compile and Run GPU Kernels#
The CUDA and HIP platforms are made available in pystencils via the code generation targets
Target.CUDA
and Target.HIP
.
For pystencils code to be portable between both, we can use Target.CurrentGPU
to
automatically select one or the other, depending on the current runtime environment.
Note
If cupy
is not installed, create_kernel
will raise an exception when using Target.CurrentGPU
.
You can still generate kernels for CUDA or HIP directly even without Cupy;
you just won’t be able to just-in-time compile and run them.
Here is a snippet creating a kernel for the locally available GPU target:
f, g = ps.fields("f, g: float64[3D]")
update = ps.Assignment(f.center(), 2 * g.center())
cfg = ps.CreateKernelConfig(target=ps.Target.CurrentGPU)
kernel = ps.create_kernel(update, cfg)
ps.inspect(kernel)
The kernel
object returned by the code generator in above snippet is an instance
of the GpuKernel
class.
It extends Kernel
with some GPU-specific information.
If a GPU is available and CuPy is installed in the current environment,
the kernel can be compiled and run immediately.
To execute the kernel, a cupy.ndarray
has to be passed for each field:
import cupy as cp
rng = cp.random.default_rng(seed=42)
f_arr = rng.random((16, 16, 16))
g_arr = cp.zeros_like(f_arr)
kfunc = kernel.compile()
kfunc(f=f_arr, g=g_arr)
Modify the Indexing Scheme and Launch Configuration#
There are two key elements to how the work items of a GPU kernel’s iteration space are mapped onto a GPU launch grid:
The indexing scheme defines the relation between thread indices and iteration space points; it can be modified through the
gpu.indexing_scheme
option and is fixed for the entire kernel.The launch configuration defines the number of threads per block, and the number of blocks on the grid, with which the kernel should be launched. The launch configuration mostly depends on the size of the arrays passed to the kernel, but parts of it may also be modified. The launch configuration may change at each kernel invocation.
The Default “Linear3D” Indexing Scheme#
By default, pystencils will employ a 1:1-mapping between threads and iteration space points via the global thread indices inside the launch grid; e.g.
ctr_0 = start_0 + step_0 * (blockSize.x * blockIdx.x + threadIdx.x);
ctr_1 = start_1 + step_1 * (blockSize.y * blockIdx.y + threadIdx.y);
ctr_2 = start_2 + step_2 * (blockSize.z * blockIdx.z + threadIdx.z);
For most kernels with an at most three-dimensional iteration space,
this behavior is sufficient and desired.
It can be enforced by setting gpu.indexing_scheme = "Linear3D"
.
The GPU thread block size of a compiled kernel’s wrapper object can only be directly modified
for manual launch configurations, cf. the section for manual launch configurations.
Linear indexing schemes without manual launch configurations either employ default block sizes
or compute block sizes using user-exposed member functions that adapt the initial block size that was
passed as an argument. The available functions are :meth:fit_block_size
and :meth:trim_block_size
.
The :meth:trim_block_size
function trims the user-defined initial block size with the iteration space.
The :meth:fit_block_size
employs a fitting algorithm that finds a suitable block configuration that is
divisible by the hardware’s warp size.
kfunc = kernel.compile()
# three different configuration cases for block size:
# a) Nothing
# -> use default block size, perform no fitting, no trimming
# b) Activate fitting with initial block size
kfunc.launch_config.fit_block_size((8, 8, 4))
# c) Activate trimming with initial block size
kfunc.launch_config.trim_block_size((8, 8, 4))
# finally run the kernel...
kfunc(f=f_arr, g=g_arr)
Block sizes aligning with multiples of the hardware’s warp size allow for a better usage of the GPUs resources.
Even though such block size configurations are not enforced, notifying our code generator via the
GpuOptions.assume_warp_aligned_block_size
option that the configured block size is divisible by the warp size allows
for further optimization potential, e.g. for warp-level reductions.
When setting this option to True
, the user has to make sure that this alignment applies.
For manual launch configurations, this can be achieved by manually providing suitable
block sizes via the launch_config.block_size
. For the other launch configurations, this criterion is guaranteed
by using the default block sizes in pystencils. Using :meth:fit_block_size
also guarantees this while also producing
block sizes that are better customized towards the kernel’s iteration space. For :meth:trim_block_size
,
the trimmed block’s dimension that is closest the next multiple of the warp size is rounded up to the next multiple
of the warp size. For all cases, the final block size is checked against the imposed hardware limits and an error
is thrown in case these limits are exceeded.
In any case. pystencils will automatically compute the grid size from the shapes of the kernel’s array arguments and the final thread block size.
Attention
According to the way GPU architecture splits thread blocks into warps,
pystencils will map the kernel’s fastest spatial coordinate onto the x
block and thread
indices, the second-fastest to y
, and the slowest coordiante to z
.
This can mean that, when using cupy
arrays with the default memory layout
(corresponding to the "numpy"
field layout specifier),
the thread coordinates and the spatial coordinates
map to each other in opposite order; e.g.
Spatial Coordinate |
Thread Index |
---|---|
|
|
|
|
|
|
Manual Launch Grids and Non-Cuboid Iteration Patterns#
By default, the above indexing schemes will automatically compute the GPU launch configuration from array shapes and optional user input. However, it is also possible to override this behavior and instead specify a launch grid manually. This will even be unavoidable if the code generator cannot precompute the number of points in the kernel’s iteration space.
To specify a manual launch configuration, set the gpu.manual_launch_grid
option to True
.
Then, after compiling the kernel, set its block and grid size via the launch_config
property:
cfg.gpu.manual_launch_grid = True
kernel = ps.create_kernel(update, cfg)
kfunc = kernel.compile()
kfunc.launch_config.block_size = (64, 2, 1)
kfunc.launch_config.grid_size = (4, 2, 1)
An example where this is necessary is the triangular iteration previously described in the Kernel Creation Guide. Let’s set it up once more:
y = ps.DEFAULTS.spatial_counters[0]
cfg = ps.CreateKernelConfig()
cfg.target= ps.Target.CurrentGPU
cfg.iteration_slice = ps.make_slice[:, y:]
For this kernel, the code generator cannot figure out a launch configuration on its own, so we need to manually provide one:
cfg.gpu.manual_launch_grid = True
kernel = ps.create_kernel(assignments, cfg).compile()
kernel.launch_config.block_size = (8, 8)
kernel.launch_config.grid_size = (2, 2)
This way the kernel will cover this iteration space:

We can also observe the effect of decreasing the launch grid size.
kernel.launch_config.block_size = (4, 4)
kernel.launch_config.grid_size = (2, 3)

Here, since there are only eight threads operating in \(x\)-direction, and twelve threads in \(y\)-direction, only a part of the triangle is being processed.
Developers To Do:
Fast approximation functions
Fp16 on GPU