Pystencils for SYCL#
Pystencils offers code generation for SYCL targets Target.SYCL.
The Pystencils Jit uses dpctl to manage devices, Queues and USM allocations.
For more information on dpctl, refer to their documenation
It can be installed via pip to our virtual environment
pip install dpctl
Note
It is possible to target Nvidia and AMD GPUs via Intel oneAPI. From version 2025.3 the corresponding plugin needs to be build from source oneAPI docs For previous versions there were the pre-built plugins from Codeplay.
Setup#
To list all available device you can run
import dpctl
dpctl.get_devices()
[<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, Quadro P2000] at 0x7248406130f0>,
<dpctl.SyclDevice [backend_type.opencl, device_type.cpu, Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz] at 0x724841a8ab30>]
To launch a kernel on a device a dpctl.SyclQueue is needed
q = dpctl.SyclQueue()
q.get_sycl_device()
<dpctl.SyclDevice [backend_type.opencl, device_type.cpu, Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz] at 0x7247feb8f3f0>
Tip
If dpctl can not find any devices take care that intel-cmplr-lib-rt, intel-cmplr-lib-ur, intel-cmplr-lic-rt, and intel-sycl-rt package are installed in the version that corresponds with the oneAPI installation.
Some times it also helps to add the lib folder of our environment to our LD_LIBRARY_PATH (see
the noxfile.py for reference)
Then we need to create some array as dpctl.tensor.usm_ndarray allocations with dpctl.tensor
import dpctl.tensor as dpt
f_arr = dpt.ones((10, 10), dtype=np.float32, sycl_queue=q)
g_arr = dpt.zeros((10, 10), dtype=np.float32, sycl_queue=q)
f_arr
usm_ndarray([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
Executing range kernels#
A simple SYCL kernel can be executed like this:
The SYCL kernels are launched with in parallel_for.
To create a sycl::range based kernel use Target.SYCL in CreateKernelConfig
dtype = "float32"
f, g = ps.fields(f"f, g: {dtype}[2D]")
asm = ps.Assignment(g.center(), 2.0 * f.center())
config = ps.CreateKernelConfig(target=ps.Target.SYCL)
ker = ps.create_kernel(asm, config=config)
kfunc = ker.compile()
kfunc(f=f_arr, g=g_arr, queue=q)
q.wait()
g_arr
usm_ndarray([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]], dtype=float32)
Executing ND-range kernels#
It is also possible to create ND-range kernels
To create a kernel launched with sycl::nd_range the SyclOptions.automatic_block_size
needs to be set to false.
config = ps.CreateKernelConfig(target=ps.Target.SYCL)
config.sycl.automatic_block_size = False
config.gpu.manual_launch_grid = True
ker = ps.create_kernel(asm, config=config)
kfunc = ker.compile()
The indexing scheme and the launch configuration work in the same way as for the GPU Targets.
kfunc.launch_config.block_size = (2, 2)
kfunc.launch_config.grid_size = (2, 3)
kfunc(f=g_arr, g=f_arr, queue=q)
q.wait()
f_arr
usm_ndarray([[4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
[4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
[4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
[4., 4., 4., 4., 4., 4., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
Note on targeting AMD GPUs#
Warning
Targeting AMD GPUs via dpctl is currently experimental and not tested thoroughly.
The SYCL compiler uses the -fsycl-target flag to add support for different targets in the produced
binary.
The spirv64 target is for CPUs or Intel GPUs, the nvptx64-nvidia-cuda is for Nvida GPUs, and amdgcn-amd-amdhsa is for AMD GPUs.
However, for AMD GPUs the exact architecture needs to be specified (for more info see here)
The jit.sycl.SYCLClangInfo it tries to figure out the correct flags for each device that it can find
via the SYCL runtime.
If the detection of an AMD GPU does not work jit.sycl.SYCLClangInfo.amd_offload_architecutres allows
to specify the correct architecture name manually.