Creating and calling kernels from Python¶
Creating kernels¶

create_kernel
(assignments, target='cpu', data_type='double', iteration_slice=None, ghost_layers=None, skip_independence_check=False, cpu_openmp=False, cpu_vectorize_info=None, cpu_blocking=None, omp_single_loop=True, gpu_indexing='block', gpu_indexing_params=mappingproxy({}), use_textures_for_interpolation=True, cpu_prepend_optimizations=[], use_auto_for_assignments=False, opencl_queue=None, opencl_ctx=None)¶ Creates abstract syntax tree (AST) of kernel, using a list of update equations.
 Parameters
assignments – can be a single assignment, sequence of assignments or an
AssignmentCollection
target – ‘cpu’, ‘llvm’, ‘gpu’ or ‘opencl’
data_type – data type used for all untyped symbols (i.e. nonfields), can also be a dict from symbol name to type
iteration_slice – rectangular subset to iterate over, if not specified the complete nonghost layer part of the field is iterated over
ghost_layers – a single integer specifies the ghost layer count at all borders, can also be a sequence of pairs
[(x_lower_gl, x_upper_gl), .... ]
. These layers are excluded from the iteration. If left to default, the number of ghost layers is determined automatically.skip_independence_check – don’t check that loop iterations are independent. This is needed e.g. for periodicity kernel, that access the field outside the iteration bounds. Use with care!
cpu_openmp – True or number of threads for OpenMP parallelization, False for no OpenMP
omp_single_loop – if OpenMP is active: whether multiple outer loops are permitted
cpu_vectorize_info – a dictionary with keys, ‘vector_instruction_set’, ‘assume_aligned’ and ‘nontemporal’ for documentation of these parameters see vectorize function. Example: ‘{‘instruction_set’: ‘avx512’, ‘assume_aligned’: True, ‘nontemporal’:True}’
cpu_blocking – a tuple of block sizes or None if no blocking should be applied
gpu_indexing – either ‘block’ or ‘line’ , or custom indexing class, see
AbstractIndexing
gpu_indexing_params – dict with indexing parameters (constructor parameters of indexing class) e.g. for ‘block’ one can specify ‘{‘block_size’: (20, 20, 10) }’
cpu_prepend_optimizations – list of extra optimizations to perform first on the AST
 Returns
abstract syntax tree (AST) object, that can either be printed as source code with
show_code
or can be compiled with through its ‘compile()’ member
Example
>>> import pystencils as ps >>> import numpy as np >>> s, d = ps.fields('s, d: [2D]') >>> assignment = ps.Assignment(d[0,0], s[0, 1] + s[0, 1] + s[1, 0] + s[1, 0]) >>> ast = ps.create_kernel(assignment, target='cpu', cpu_openmp=True) >>> kernel = ast.compile() >>> d_arr = np.zeros([5, 5]) >>> kernel(d=d_arr, s=np.ones([5, 5])) >>> d_arr array([[0., 0., 0., 0., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 0., 0., 0., 0.]])

create_indexed_kernel
(assignments, index_fields, target='cpu', data_type='double', coordinate_names='x', 'y', 'z', cpu_openmp=True, gpu_indexing='block', gpu_indexing_params=mappingproxy({}), use_textures_for_interpolation=True, opencl_queue=None, opencl_ctx=None)¶ Similar to
create_kernel()
, but here not all cells of a field are updated but only cells with coordinates which are stored in an index field. This traversal method can e.g. be used for boundary handling.The coordinates are stored in a separated index_field, which is a one dimensional array with struct data type. This struct has to contain fields named ‘x’, ‘y’ and for 3D fields (‘z’). These names are configurable with the ‘coordinate_names’ parameter. The struct can have also other fields that can be read and written in the kernel, for example boundary parameters.
index_fields: list of index fields, i.e. 1D fields with struct data type coordinate_names: name of the coordinate fields in the struct data type
Example
>>> import pystencils as ps >>> import numpy as np >>> >>> # Index field stores the indices of the cell to visit together with optional values >>> index_arr_dtype = np.dtype([('x', np.int32), ('y', np.int32), ('val', np.double)]) >>> index_arr = np.array([(1, 1, 0.1), (2, 2, 0.2), (3, 3, 0.3)], dtype=index_arr_dtype) >>> idx_field = ps.fields(idx=index_arr) >>> >>> # Additional values stored in index field can be accessed in the kernel as well >>> s, d = ps.fields('s, d: [2D]') >>> assignment = ps.Assignment(d[0,0], 2 * s[0, 1] + 2 * s[1, 0] + idx_field('val')) >>> ast = create_indexed_kernel(assignment, [idx_field], coordinate_names=('x', 'y')) >>> kernel = ast.compile() >>> d_arr = np.zeros([5, 5]) >>> kernel(s=np.ones([5, 5]), d=d_arr, idx=index_arr) >>> d_arr array([[0. , 0. , 0. , 0. , 0. ], [0. , 4.1, 0. , 0. , 0. ], [0. , 0. , 4.2, 0. , 0. ], [0. , 0. , 0. , 4.3, 0. ], [0. , 0. , 0. , 0. , 0. ]])

create_staggered_kernel
(assignments, target='cpu', gpu_exclusive_conditions=False, **kwargs)¶ Kernel that updates a staggered field.
For a staggered field, the first index coordinate defines the location of the staggered value. Further index coordinates can be used to store vectors/tensors at each point.
 Parameters
assignments – a sequence of assignments or an AssignmentCollection. Assignments to staggered field are processed specially, while subexpressions and assignments to regular fields are passed through to
create_kernel
. Multiple different staggered fields can be used, but they all need to use the same stencil (i.e. the same number of staggered points) and shape.target – ‘cpu’, ‘llvm’ or ‘gpu’
gpu_exclusive_conditions – disable the use of multiple conditionals inside the loop. The outer layers are then handled in an else branch.
kwargs – passed directly to create_kernel, iteration_slice and ghost_layers parameters are not allowed
 Returns
AST, see
create_kernel
GPU Indexing¶

class
AbstractIndexing
¶ Abstract base class for all Indexing classes. An Indexing class defines how a multidimensional field is mapped to CUDA’s block and grid system. It calculates indices based on CUDA’s thread and block indices and computes the number of blocks and threads a kernel is started with. The Indexing class is created with a pystencils field, a slice to iterate over, and further optional parameters that must have default values.

abstract property
coordinates
¶ Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic CUDA block and thread indices. These symbolic indices can be obtained with the method
index_variables

property
index_variables
¶ Sympy symbols for CUDA’s block and thread indices, and block and grid dimensions.

abstract
call_parameters
(arr_shape)¶ Determine grid and block size for kernel call.
 Parameters
arr_shape – the numeric (not symbolic) shape of the array
 Returns
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with

abstract
guard
(kernel_content, arr_shape)¶ In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
 Parameters
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
 Returns
ast node, which is put inside the kernel function

abstract
max_threads_per_block
()¶ Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown

abstract
symbolic_parameters
()¶ Set of symbols required in call_parameters code

abstract property

class
BlockIndexing
(field, iteration_slice, block_size=16, 16, 1, permute_block_size_dependent_on_layout=True, compile_time_block_size=False, maximum_block_size=1024, 1024, 64)¶ Generic indexing scheme that maps subblocks of an array to CUDA blocks.
 Parameters
field – pystencils field (common to all Indexing classes)
iteration_slice – slice that defines rectangular subarea which is iterated over
permute_block_size_dependent_on_layout – if True the block_size is permuted such that the fastest coordinate gets the largest amount of threads
compile_time_block_size – compile in concrete block size, otherwise the cuda variable ‘blockDim’ is used

property
coordinates
¶ Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic CUDA block and thread indices. These symbolic indices can be obtained with the method
index_variables

call_parameters
(arr_shape)¶ Determine grid and block size for kernel call.
 Parameters
arr_shape – the numeric (not symbolic) shape of the array
 Returns
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with

guard
(kernel_content, arr_shape)¶ In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
 Parameters
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
 Returns
ast node, which is put inside the kernel function

static
limit_block_size_by_register_restriction
(block_size, required_registers_per_thread, device=None)¶ Shrinks the block_size if there are too many registers used per multiprocessor. This is not done automatically, since the required_registers_per_thread are not known before compilation. They can be obtained by
func.num_regs
from a pycuda function. :returns smaller block_size if too many registers are used.

static
permute_block_size_according_to_layout
(block_size, layout)¶ Returns modified block_size such that the fastest coordinate gets the biggest block dimension

max_threads_per_block
()¶ Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown

symbolic_parameters
()¶ Set of symbols required in call_parameters code

class
LineIndexing
(field, iteration_slice)¶ Indexing scheme that assigns the innermost ‘line’ i.e. the elements which are adjacent in memory to a 1D CUDA block. The fastest coordinate is indexed with thread_idx.x, the remaining coordinates are mapped to block_idx.{x,y,z} This indexing scheme supports up to 4 spatial dimensions, where the innermost dimensions is not larger than the maximum amount of threads allowed in a CUDA block (which depends on device).

property
coordinates
¶ Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic CUDA block and thread indices. These symbolic indices can be obtained with the method
index_variables

call_parameters
(arr_shape)¶ Determine grid and block size for kernel call.
 Parameters
arr_shape – the numeric (not symbolic) shape of the array
 Returns
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with

guard
(kernel_content, arr_shape)¶ In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
 Parameters
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
 Returns
ast node, which is put inside the kernel function

max_threads_per_block
()¶ Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown

symbolic_parameters
()¶ Set of symbols required in call_parameters code

property