Creating and calling kernels from Python¶
Creating kernels¶
 create_kernel(assignments, *, config=None, **kwargs)¶
Creates abstract syntax tree (AST) of kernel, using a list of update equations. This function forms the general API and delegates the kernel creation to others depending on the CreateKernelConfig. :type assignments:
Union
[Assignment
,List
[Assignment
],AddAugmentedAssignment
,List
[AddAugmentedAssignment
],AssignmentCollection
,List
[Node
],NodeCollection
] :param assignments: can be a single assignment, sequence of assignments or anAssignmentCollection
:type config:Optional
[CreateKernelConfig
] :param config: CreateKernelConfig which includes the needed configuration :type kwargs: :param kwargs: Arguments for updating the config Returns:
abstract syntax tree (AST) object, that can either be printed as source code with
show_code
or can be compiled with through its ‘compile()’ member
Example
>>> import pystencils as ps >>> import numpy as np >>> s, d = ps.fields('s, d: [2D]') >>> assignment = ps.Assignment(d[0,0], s[0, 1] + s[0, 1] + s[1, 0] + s[1, 0]) >>> kernel_ast = ps.create_kernel(assignment, config=ps.CreateKernelConfig(cpu_openmp=True)) >>> kernel = kernel_ast.compile() >>> d_arr = np.zeros([5, 5]) >>> kernel(d=d_arr, s=np.ones([5, 5])) >>> d_arr array([[0., 0., 0., 0., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 0., 0., 0., 0.]])
 class CreateKernelConfig(target=Target.CPU, backend=None, function_name='kernel', data_type=<class 'numpy.float64'>, default_number_float=None, default_number_int=<class 'numpy.int64'>, iteration_slice=None, ghost_layers=None, cpu_openmp=False, cpu_vectorize_info=None, cpu_blocking=None, omp_single_loop=True, base_pointer_specification=None, gpu_indexing='block', gpu_indexing_params=<factory>, default_assignment_simplifications=False, cpu_prepend_optimizations=<factory>, use_auto_for_assignments=False, index_fields=None, coordinate_names=('x', 'y', 'z'), allow_double_writes=False, skip_independence_check=False)¶
Below all parameters for the CreateKernelConfig are explained

target:
Target
= 1¶ All targets are defined in
pystencils.enums.Target

backend:
Backend
= None¶ All backends are defined in
pystencils.enums.Backend

function_name:
str
= 'kernel'¶ Name of the generated function  only important if generated code is written out
 data_type¶
alias of
float64

default_number_float:
Union
[type
,str
,BasicType
] = None¶ Data type used for all untyped floating point numbers (i.e. 0.5). By default the value of data_type is used. If data_type is given as a defaultdict its default_factory is used.
 default_number_int¶
alias of
int64

iteration_slice:
Tuple
= None¶ Rectangular subset to iterate over, if not specified the complete nonghost layer part of the field is iterated over

ghost_layers:
Union
[bool
,int
,List
[Tuple
[int
]]] = None¶ A single integer specifies the ghost layer count at all borders, can also be a sequence of pairs
[(x_lower_gl, x_upper_gl), .... ]
. These layers are excluded from the iteration. If left to default, the number of ghost layers is determined automatically from the assignments.

cpu_openmp:
Union
[bool
,int
] = False¶ True or number of threads for OpenMP parallelization, False for no OpenMP. If set to True, the maximum number of available threads will be chosen.

cpu_vectorize_info:
Dict
= None¶ A dictionary with keys, ‘vector_instruction_set’, ‘assume_aligned’ and ‘nontemporal’ for documentation of these parameters see vectorize function. Example: ‘{‘instruction_set’: ‘avx512’, ‘assume_aligned’: True, ‘nontemporal’:True}’

base_pointer_specification:
Union
[List
[Iterable
[str
]],List
[Iterable
[int
]]] = None¶ Specification of how many and which intermediate pointers are created for a field access. For example [ (0), (2,3,)] creates on base pointer for coordinates 2 and 3 and writes the offset for coordinate zero directly in the field access. These specifications are defined dependent on the loop ordering. This function translates more readable version into the specification above.
For more information see: pystencils.transformations.create_intermediate_base_pointer

gpu_indexing:
str
= 'block'¶ Either ‘block’ or ‘line’ , or custom indexing class, see pystencils.gpu.AbstractIndexing

gpu_indexing_params:
MappingProxyType
¶ Dict with indexing parameters (constructor parameters of indexing class) e.g. for ‘block’ one can specify ‘{‘block_size’: (20, 20, 10) }’.

default_assignment_simplifications:
bool
= False¶ If True default simplifications are first performed on the Assignments. If problems occur during the simplification a warning will be thrown. Furthermore, it is essential to know that this is a twostage process. The first stage of the process acts on the level of the pystencils.AssignmentCollection. In this part, pystencil.simp.create_simplification_strategy from pystencils.simplificationfactory will be used to apply optimisations like insertion of constants to remove pressure from the registers. Thus the first part of the optimisations can only be executed if an AssignmentCollection is passed. The second part of the optimisation acts on the level of each Assignment individually. In this stage, all optimisations from sympy.codegen.rewriting.optims_c99 are applied to each Assignment. Thus this stage can also be applied if a list of Assignments is passed.

use_auto_for_assignments:
bool
= False¶ If set to True, auto can be used in the generated code for data types. This makes the type system more robust.

index_fields:
List
[Field
] = None¶ List of index fields, i.e. 1D fields with struct data type. If not None, create_index_kernel instead of create_domain_kernel is used.

coordinate_names:
Tuple
[str
,Any
] = ('x', 'y', 'z')¶ Name of the coordinate fields in the struct data type.

allow_double_writes:
bool
= False¶ If True, don’t check if every field is only written at a single location. This is required for example for kernels that are compiled with loop step sizes > 1, that handle multiple cells at once. Use with care!

skip_independence_check:
bool
= False¶ Don’t check that loop iterations are independent. This is needed e.g. for periodicity kernel, that access the field outside the iteration bounds. Use with care!
 class DataTypeFactory(dt)¶
Because of pickle, we need to have a nested class, instead of a lambda in __post_init__

target:
 create_domain_kernel(assignments, *, config)¶
Creates abstract syntax tree (AST) of kernel, using a NodeCollection.
Note that create_domain_kernel is a lower level function which shoul be accessed by not providing index_fields to create_kernel
 Parameters:
assignments (
NodeCollection
) – pystencils.node_collection.NodeCollection containing all assignements and nodes to be processedconfig (
CreateKernelConfig
) – CreateKernelConfig which includes the needed configuration
 Returns:
abstract syntax tree (AST) object, that can either be printed as source code with show_code or can be compiled with through its ‘compile()’ member
Example
>>> import pystencils as ps >>> import numpy as np >>> from pystencils.kernelcreation import create_domain_kernel >>> from pystencils.node_collection import NodeCollection >>> s, d = ps.fields('s, d: [2D]') >>> assignment = ps.Assignment(d[0,0], s[0, 1] + s[0, 1] + s[1, 0] + s[1, 0]) >>> kernel_config = ps.CreateKernelConfig(cpu_openmp=True) >>> kernel_ast = create_domain_kernel(NodeCollection([assignment]), config=kernel_config) >>> kernel = kernel_ast.compile() >>> d_arr = np.zeros([5, 5]) >>> kernel(d=d_arr, s=np.ones([5, 5])) >>> d_arr array([[0., 0., 0., 0., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 4., 4., 4., 0.], [0., 0., 0., 0., 0.]])
 create_indexed_kernel(assignments, *, config)¶
Similar to
create_kernel()
, but here not all cells of a field are updated but only cells with coordinates which are stored in an index field. This traversal method can e.g. be used for boundary handling.The coordinates are stored in a separated index_field, which is a one dimensional array with struct data type. This struct has to contain fields named ‘x’, ‘y’ and for 3D fields (‘z’). These names are configurable with the ‘coordinate_names’ parameter. The struct can have also other fields that can be read and written in the kernel, for example boundary parameters.
Note that create_indexed_kernel is a lower level function which shoul be accessed by providing index_fields to create_kernel
 Parameters:
assignments (
NodeCollection
) – pystencils.node_collection.NodeCollection containing all assignements and nodes to be processedconfig (
CreateKernelConfig
) – CreateKernelConfig which includes the needed configuration
 Returns:
abstract syntax tree (AST) object, that can either be printed as source code with show_code or can be compiled with through its ‘compile()’ member
Example
>>> import pystencils as ps >>> from pystencils.node_collection import NodeCollection >>> import numpy as np >>> from pystencils.kernelcreation import create_indexed_kernel >>> >>> # Index field stores the indices of the cell to visit together with optional values >>> index_arr_dtype = np.dtype([('x', np.int32), ('y', np.int32), ('val', np.double)]) >>> index_arr = np.array([(1, 1, 0.1), (2, 2, 0.2), (3, 3, 0.3)], dtype=index_arr_dtype) >>> idx_field = ps.fields(idx=index_arr) >>> >>> # Additional values stored in index field can be accessed in the kernel as well >>> s, d = ps.fields('s, d: [2D]') >>> assignment = ps.Assignment(d[0, 0], 2 * s[0, 1] + 2 * s[1, 0] + idx_field('val')) >>> kernel_config = ps.CreateKernelConfig(index_fields=[idx_field], coordinate_names=('x', 'y')) >>> kernel_ast = create_indexed_kernel(NodeCollection([assignment]), config=kernel_config) >>> kernel = kernel_ast.compile() >>> d_arr = np.zeros([5, 5]) >>> kernel(s=np.ones([5, 5]), d=d_arr, idx=index_arr) >>> d_arr array([[0. , 0. , 0. , 0. , 0. ], [0. , 4.1, 0. , 0. , 0. ], [0. , 0. , 4.2, 0. , 0. ], [0. , 0. , 0. , 4.3, 0. ], [0. , 0. , 0. , 0. , 0. ]])
 create_staggered_kernel(assignments, target=Target.CPU, gpu_exclusive_conditions=False, **kwargs)¶
Kernel that updates a staggered field.
For a staggered field, the first index coordinate defines the location of the staggered value. Further index coordinates can be used to store vectors/tensors at each point.
 Parameters:
assignments – a sequence of assignments or an AssignmentCollection. Assignments to staggered field are processed specially, while subexpressions and assignments to regular fields are passed through to create_kernel. Multiple different staggered fields can be used, but they all need to use the same stencil (i.e. the same number of staggered points) and shape.
target (
Target
) – ‘CPU’ or ‘GPU’gpu_exclusive_conditions – disable the use of multiple conditionals inside the loop. The outer layers are then handled in an else branch.
kwargs – passed directly to create_kernel, iteration_slice and ghost_layers parameters are not allowed
 Returns:
AST, see create_kernel
Code printing¶
 show_code(ast, custom_backend=None)¶
GPU Indexing¶
 class AbstractIndexing(iteration_space, data_layout)¶
Abstract base class for all Indexing classes. An Indexing class defines how an iteration space is mapped to GPU’s block and grid system. It calculates indices based on GPU’s thread and block indices and computes the number of blocks and threads a kernel is started with. The Indexing class is created with an iteration space that is given as list of slices to determine start, stop and the step size for each coordinate. Further the data_layout is given as tuple to determine the fast and slow coordinates. This is important to get an optimal mapping of coordinates to GPU threads.
 property iteration_space¶
Iteration space to loop over
 property data_layout¶
Data layout of the kernels arrays
 property dim¶
Number of spatial dimensions
 abstract property coordinates¶
Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic GPU block and thread indices. These symbolic indices can be obtained with the method index_variables
 property index_variables¶
Sympy symbols for GPU’s block and thread indices, and block and grid dimensions.
 abstract get_loop_ctr_assignments(loop_counter_symbols)¶
Adds assignments for the loop counter symbols depending on the gpu threads.
 Parameters:
loop_counter_symbols – typed symbols representing the loop counters
 Return type:
 Returns:
assignments for the loop counters
 abstract call_parameters(arr_shape)¶
Determine grid and block size for kernel call.
 Parameters:
arr_shape – the numeric (not symbolic) shape of the array
 Returns:
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with
 abstract guard(kernel_content, arr_shape)¶
In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
 Parameters:
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
 Returns:
ast node, which is put inside the kernel function
 abstract max_threads_per_block()¶
Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown
 abstract symbolic_parameters()¶
Set of symbols required in call_parameters code
 class BlockIndexing(iteration_space, data_layout, block_size=(128, 2, 1), permute_block_size_dependent_on_layout=True, compile_time_block_size=False, maximum_block_size=(1024, 1024, 64), device_number=None)¶
Generic indexing scheme that maps subblocks of an array to GPU blocks.
 Parameters:
iteration_space (
Tuple
[slice
]) – list of slices to determine start, stop and the step size for each coordinatedata_layout (
Tuple
[int
]) – tuple specifying loop order with innermost loop last. This is the same format as returned by Field.layout.permute_block_size_dependent_on_layout – if True the block_size is permuted such that the fastest coordinate gets the largest amount of threads
compile_time_block_size – compile in concrete block size, otherwise the gpu variable ‘blockDim’ is used
maximum_block_size – maximum block size that is possible for the GPU. Set to ‘auto’ to let cupy define the maximum block size from the device properties
device_number – device number of the used GPU. By default, the zeroth device is used.
 property coordinates¶
Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic GPU block and thread indices. These symbolic indices can be obtained with the method index_variables
 get_loop_ctr_assignments(loop_counter_symbols)¶
Adds assignments for the loop counter symbols depending on the gpu threads.
 Parameters:
loop_counter_symbols – typed symbols representing the loop counters
 Returns:
assignments for the loop counters
 call_parameters(arr_shape)¶
Determine grid and block size for kernel call.
 Parameters:
arr_shape – the numeric (not symbolic) shape of the array
 Returns:
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with
 guard(kernel_content, arr_shape)¶
In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
 Parameters:
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
 Returns:
ast node, which is put inside the kernel function
 limit_block_size_by_register_restriction(block_size, required_registers_per_thread)¶
Shrinks the block_size if there are too many registers used per block. This is not done automatically, since the required_registers_per_thread are not known before compilation. They can be obtained by
func.num_regs
from a cupy function. :type block_size: :param block_size: used block size that is target for limiting :type required_registers_per_thread: :param required_registers_per_thread: needed registers per threadreturns: smaller block_size if too many registers are used.
 static permute_block_size_according_to_layout(block_size, layout)¶
Returns modified block_size such that the fastest coordinate gets the biggest block dimension
 max_threads_per_block()¶
Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown
 symbolic_parameters()¶
Set of symbols required in call_parameters code
 class LineIndexing(iteration_space, data_layout)¶
Indexing scheme that assigns the innermost ‘line’ i.e. the elements which are adjacent in memory to a 1D GPU block. The fastest coordinate is indexed with thread_idx.x, the remaining coordinates are mapped to block_idx.{x,y,z} This indexing scheme supports up to 4 spatial dimensions, where the innermost dimensions is not larger than the maximum amount of threads allowed in a GPU block (which depends on device).
 Parameters:
 property coordinates¶
Returns a sequence of coordinate expressions for (x,y,z) depending on symbolic GPU block and thread indices. These symbolic indices can be obtained with the method index_variables
 get_loop_ctr_assignments(loop_counter_symbols)¶
Adds assignments for the loop counter symbols depending on the gpu threads.
 Parameters:
loop_counter_symbols – typed symbols representing the loop counters
 Returns:
assignments for the loop counters
 call_parameters(arr_shape)¶
Determine grid and block size for kernel call.
 Parameters:
arr_shape – the numeric (not symbolic) shape of the array
 Returns:
dict with keys ‘blocks’ and ‘threads’ with tuple values for number of (x,y,z) threads and blocks the kernel should be started with
 guard(kernel_content, arr_shape)¶
In some indexing schemes not all threads of a block execute the kernel content.
This function can return a Conditional ast node, defining this execution guard.
 Parameters:
kernel_content – the actual kernel contents which can e.g. be put into the Conditional node as true block
arr_shape – the numeric or symbolic shape of the field
 Returns:
ast node, which is put inside the kernel function
 max_threads_per_block()¶
Return maximal number of threads per block for launch bounds. If this cannot be determined without knowing the array shape return None for unknown
 symbolic_parameters()¶
Set of symbols required in call_parameters code