Creating and calling kernels from Python¶
Creating kernels¶
-
create_kernel
(assignments, target='cpu', data_type='double', iteration_slice=None, ghost_layers=None, skip_independence_check=False, cpu_openmp=False, cpu_vectorize_info=None, cpu_blocking=None, omp_single_loop=True, gpu_indexing='block', gpu_indexing_params=mappingproxy({}), use_textures_for_interpolation=True, cpu_prepend_optimizations=[], use_auto_for_assignments=False, opencl_queue=None, opencl_ctx=None)¶ Creates abstract syntax tree (AST) of kernel, using a list of update equations.
- Parameters
assignments – can be a single assignment, sequence of assignments or an
AssignmentCollection
target – ‘cpu’, ‘llvm’, ‘gpu’ or ‘opencl’
data_type – data type used for all untyped symbols (i.e. non-fields), can also be a dict from symbol name to type
iteration_slice – rectangular subset to iterate over, if not specified the complete non-ghost layer part of the field is iterated over
ghost_layers – a single integer specifies the ghost layer count at all borders, can also be a sequence of pairs
[(x_lower_gl, x_upper_gl), .... ]
. These layers are excluded from the iteration. If left to default, the number of ghost layers is determined automatically.skip_independence_check – don’t check that loop iterations are independent. This is needed e.g. for periodicity kernel, that access the field outside the iteration bounds. Use with care!
cpu_openmp – True or number of threads for OpenMP parallelization, False for no OpenMP
omp_single_loop – if OpenMP is active: whether multiple outer loops are permitted
cpu_vectorize_info – a dictionary with keys, ‘vector_instruction_set’, ‘assume_aligned’ and ‘nontemporal’ for documentation of these parameters see vectorize function. Example: ‘{‘instruction_set’: ‘avx512’, ‘assume_aligned’: True, ‘nontemporal’:True}’
cpu_blocking – a tuple of block sizes or None if no blocking should be applied
gpu_indexing – either ‘block’ or ‘line’ , or custom indexing class, see
AbstractIndexing
gpu_indexing_params – dict with indexing parameters (constructor parameters of indexing class) e.g. for ‘block’ one can specify ‘{‘block_size’: (20, 20, 10) }’
cpu_prepend_optimizations – list of extra optimizations to perform first on the AST
- Returns
abstract syntax tree (AST) object, that can either be printed as source code with
show_code
or can be compiled with through its ‘compile()’ member
Example
>>> import pystencils as ps >>> import numpy as np >>> s, d = ps.fields('s, d: [2D]') >>> assignment = ps.Assignment(d[0,0], s[0, 1] + s[0, -1] + s[1, 0] + s[-1, 0]) >>> ast = ps.create_kernel(assignment, target='cpu', cpu_openmp=True) >>> kernel = ast.compile() >>> d_arr = np.zeros([5, 5]) >>> kernel(d=d_arr, s=np.ones([5, 5])) >>> d_arr array([[0., 0., 0., 0., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 0., 0., 0., 0.]])
-
create_indexed_kernel
(assignments, index_fields, target='cpu', data_type='double', coordinate_names=('x', 'y', 'z'), cpu_openmp=True, gpu_indexing='block', gpu_indexing_params=mappingproxy({}), use_textures_for_interpolation=True, opencl_queue=None, opencl_ctx=None)¶ Similar to
create_kernel()
, but here not all cells of a field are updated but only cells with coordinates which are stored in an index field. This traversal method can e.g. be used for boundary handling.The coordinates are stored in a separated index_field, which is a one dimensional array with struct data type. This struct has to contain fields named ‘x’, ‘y’ and for 3D fields (‘z’). These names are configurable with the ‘coordinate_names’ parameter. The struct can have also other fields that can be read and written in the kernel, for example boundary parameters.
index_fields: list of index fields, i.e. 1D fields with struct data type coordinate_names: name of the coordinate fields in the struct data type
Example
>>> import pystencils as ps >>> import numpy as np >>> >>> # Index field stores the indices of the cell to visit together with optional values >>> index_arr_dtype = np.dtype([('x', np.int32), ('y', np.int32), ('val', np.double)]) >>> index_arr = np.array([(1, 1, 0.1), (2, 2, 0.2), (3, 3, 0.3)], dtype=index_arr_dtype) >>> idx_field = ps.fields(idx=index_arr) >>> >>> # Additional values stored in index field can be accessed in the kernel as well >>> s, d = ps.fields('s, d: [2D]') >>> assignment = ps.Assignment(d[0,0], 2 * s[0, 1] + 2 * s[1, 0] + idx_field('val')) >>> ast = create_indexed_kernel(assignment, [idx_field], coordinate_names=('x', 'y')) >>> kernel = ast.compile() >>> d_arr = np.zeros([5, 5]) >>> kernel(s=np.ones([5, 5]), d=d_arr, idx=index_arr) >>> d_arr array([[0. , 0. , 0. , 0. , 0. ], [0. , 4.1, 0. , 0. , 0. ], [0. , 0. , 4.2, 0. , 0. ], [0. , 0. , 0. , 4.3, 0. ], [0. , 0. , 0. , 0. , 0. ]])
-
create_staggered_kernel
(assignments, target='cpu', gpu_exclusive_conditions=False, **kwargs)¶ Kernel that updates a staggered field.
For a staggered field, the first index coordinate defines the location of the staggered value. Further index coordinates can be used to store vectors/tensors at each point.
- Parameters
assignments – a sequence of assignments or an AssignmentCollection. Assignments to staggered field are processed specially, while subexpressions and assignments to regular fields are passed through to
create_kernel
. Multiple different staggered fields can be used, but they all need to use the same stencil (i.e. the same number of staggered points) and shape.target – ‘cpu’, ‘llvm’ or ‘gpu’
gpu_exclusive_conditions – disable the use of multiple conditionals inside the loop. The outer layers are then handled in an else branch.
kwargs – passed directly to create_kernel, iteration_slice and ghost_layers parameters are not allowed
- Returns
AST, see
create_kernel
GPU Indexing¶
-
class
AbstractIndexing
¶ Abstract base class for all Indexing classes. An Indexing class defines how a multidimensional field is mapped to CUDA’s block and grid system. It calculates indices based on CUDA’s thread and block indices and computes the number of blocks and threads a kernel is started with. The Indexing class is created with a pystencils field, a slice to iterate over, and further optional parameters that must have default values.
-
abstract property
coordinates
¶ Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic CUDA block and thread indices. These symbolic indices can be obtained with the method
index_variables
-
property
index_variables
¶ Sympy symbols for CUDA’s block and thread indices, and block and grid dimensions.
-
abstract
call_parameters
(arr_shape)¶ Determine grid and block size for kernel call.
- Parameters
arr_shape – the numeric (not symbolic) shape of the array
- Returns
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with
-
abstract
guard
(kernel_content, arr_shape)¶ In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
- Parameters
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
- Returns
ast node, which is put inside the kernel function
-
abstract
max_threads_per_block
()¶ Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown
-
abstract
symbolic_parameters
()¶ Set of symbols required in call_parameters code
-
abstract property
-
class
BlockIndexing
(field, iteration_slice, block_size=(16, 16, 1), permute_block_size_dependent_on_layout=True, compile_time_block_size=False, maximum_block_size=(1024, 1024, 64))¶ Generic indexing scheme that maps sub-blocks of an array to CUDA blocks.
- Parameters
field – pystencils field (common to all Indexing classes)
iteration_slice – slice that defines rectangular subarea which is iterated over
permute_block_size_dependent_on_layout – if True the block_size is permuted such that the fastest coordinate gets the largest amount of threads
compile_time_block_size – compile in concrete block size, otherwise the cuda variable ‘blockDim’ is used
-
property
coordinates
¶ Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic CUDA block and thread indices. These symbolic indices can be obtained with the method
index_variables
-
call_parameters
(arr_shape)¶ Determine grid and block size for kernel call.
- Parameters
arr_shape – the numeric (not symbolic) shape of the array
- Returns
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with
-
guard
(kernel_content, arr_shape)¶ In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
- Parameters
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
- Returns
ast node, which is put inside the kernel function
-
static
limit_block_size_by_register_restriction
(block_size, required_registers_per_thread, device=None)¶ Shrinks the block_size if there are too many registers used per multiprocessor. This is not done automatically, since the required_registers_per_thread are not known before compilation. They can be obtained by
func.num_regs
from a pycuda function. :returns smaller block_size if too many registers are used.
-
static
permute_block_size_according_to_layout
(block_size, layout)¶ Returns modified block_size such that the fastest coordinate gets the biggest block dimension
-
max_threads_per_block
()¶ Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown
-
symbolic_parameters
()¶ Set of symbols required in call_parameters code
-
class
LineIndexing
(field, iteration_slice)¶ Indexing scheme that assigns the innermost ‘line’ i.e. the elements which are adjacent in memory to a 1D CUDA block. The fastest coordinate is indexed with thread_idx.x, the remaining coordinates are mapped to block_idx.{x,y,z} This indexing scheme supports up to 4 spatial dimensions, where the innermost dimensions is not larger than the maximum amount of threads allowed in a CUDA block (which depends on device).
-
property
coordinates
¶ Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic CUDA block and thread indices. These symbolic indices can be obtained with the method
index_variables
-
call_parameters
(arr_shape)¶ Determine grid and block size for kernel call.
- Parameters
arr_shape – the numeric (not symbolic) shape of the array
- Returns
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with
-
guard
(kernel_content, arr_shape)¶ In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
- Parameters
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
- Returns
ast node, which is put inside the kernel function
-
max_threads_per_block
()¶ Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown
-
symbolic_parameters
()¶ Set of symbols required in call_parameters code
-
property