Pystencils for SYCL#

Pystencils offers code generation for SYCL targets Target.SYCL. The Pystencils Jit uses dpctl to manage devices, Queues and USM allocations. For more information on dpctl, refer to their documenation It can be installed via pip to our virtual environment

pip install dpctl

Note

It is possible to target Nvidia and AMD GPUs via Intel oneAPI. From version 2025.3 the corresponding plugin needs to be build from source oneAPI docs For previous versions there were the pre-built plugins from Codeplay.

Setup#

To list all available device you can run

import dpctl
dpctl.get_devices()
[<dpctl.SyclDevice [backend_type.cuda, device_type.gpu,  Quadro P2000] at 0x7248406130f0>,
 <dpctl.SyclDevice [backend_type.opencl, device_type.cpu,  Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz] at 0x724841a8ab30>]

To launch a kernel on a device a dpctl.SyclQueue is needed

q = dpctl.SyclQueue()
q.get_sycl_device()
<dpctl.SyclDevice [backend_type.opencl, device_type.cpu,  Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz] at 0x7247feb8f3f0>

Tip

If dpctl can not find any devices take care that intel-cmplr-lib-rt, intel-cmplr-lib-ur, intel-cmplr-lic-rt, and intel-sycl-rt package are installed in the version that corresponds with the oneAPI installation. Some times it also helps to add the lib folder of our environment to our LD_LIBRARY_PATH (see the noxfile.py for reference)

Then we need to create some array as dpctl.tensor.usm_ndarray allocations with dpctl.tensor

import dpctl.tensor as dpt
f_arr = dpt.ones((10, 10), dtype=np.float32, sycl_queue=q)
g_arr = dpt.zeros((10, 10), dtype=np.float32, sycl_queue=q)
f_arr
usm_ndarray([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)

Executing range kernels#

A simple SYCL kernel can be executed like this: The SYCL kernels are launched with in parallel_for. To create a sycl::range based kernel use Target.SYCL in CreateKernelConfig

dtype = "float32"
f, g = ps.fields(f"f, g: {dtype}[2D]")
asm = ps.Assignment(g.center(), 2.0 * f.center())
config = ps.CreateKernelConfig(target=ps.Target.SYCL)
ker = ps.create_kernel(asm, config=config)
kfunc = ker.compile()
kfunc(f=f_arr, g=g_arr, queue=q)
q.wait()
g_arr
usm_ndarray([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
             [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]], dtype=float32)

Executing ND-range kernels#

It is also possible to create ND-range kernels To create a kernel launched with sycl::nd_range the SyclOptions.automatic_block_size needs to be set to false.

config = ps.CreateKernelConfig(target=ps.Target.SYCL)
config.sycl.automatic_block_size = False
config.gpu.manual_launch_grid = True
ker = ps.create_kernel(asm, config=config)
kfunc = ker.compile()

The indexing scheme and the launch configuration work in the same way as for the GPU Targets.

kfunc.launch_config.block_size = (2, 2)
kfunc.launch_config.grid_size = (2, 3)
kfunc(f=g_arr, g=f_arr, queue=q)
q.wait()
f_arr
usm_ndarray([[4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
             [4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
             [4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
             [4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
             [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)

Note on targeting AMD GPUs#

Warning

Targeting AMD GPUs via dpctl is currently experimental and not tested thoroughly.

The SYCL compiler uses the -fsycl-target flag to add support for different targets in the produced binary. The spirv64 target is for CPUs or Intel GPUs, the nvptx64-nvidia-cuda is for Nvida GPUs, and amdgcn-amd-amdhsa is for AMD GPUs. However, for AMD GPUs the exact architecture needs to be specified (for more info see here) The jit.sycl.SYCLClangInfo it tries to figure out the correct flags for each device that it can find via the SYCL runtime. If the detection of an AMD GPU does not work jit.sycl.SYCLClangInfo.amd_offload_architecutres allows to specify the correct architecture name manually.