Reductions in Pystencils#

Reductions play a vital role in numerical simulations as they allow aggregating data across multiple elements, such as computing sums, products over an array or finding its minima or maxima.

Specifying Assignments with Reductions#

In pystencils, reductions are made available via specialized assignments, namely ReductionAssignment. Here is a snippet creating a reduction assignment for adding up all elements of a field:

r = ps.TypedSymbol("r", "double")
x = ps.fields(f"x: double[3D]", layout="fzyx")

assign_sum = ps.AddReductionAssignment(r, x.center())

For each point in the iteration space, the left-hand side symbol r accumulates the contents of the right-hand side x.center(). In our case, the AddReductionAssignment denotes an accumulation via additions.

Pystencils requires type information about the reduction symbols and thus requires r to be a TypedSymbol.

The following reduction assignment classes are available in pystencils:

  • AddReductionAssignment: Builds sum over elements

  • SubReductionAssignment: Builds difference over elements

  • MulReductionAssignment: Builds product over elements

  • MinReductionAssignment: Finds minimum element

  • MaxReductionAssignment: Finds maximum element

Note

Alternatívely, you can also make use of the reduction_assignment function to specify reduction assignments:

from pystencils.sympyextensions import reduction_assignment
from pystencils.sympyextensions.reduction import ReductionOp

assign_sum = reduction_assignment(r, ReductionOp.Add, x.center())

For other reduction operations, the following enums can be passed to reduction_assignment.

class ReductionOp(Enum):
    Add = "+"
    Sub = "-"
    Mul = "*"
    Min = "min"
    Max = "max"

Generating and Running Reduction Kernels#

With the assignments being fully assembled, we can finally invoke the code generator and create the kernel object via the create_kernel function.

CPU Platforms#

For this example, we assume a kernel configuration for CPU platforms with no optimizations explicitly enabled.

cpu_cfg = ps.CreateKernelConfig(target=ps.Target.CurrentCPU)
kernel = ps.create_kernel(assign_sum, cpu_cfg)

ps.inspect(kernel)
/builds/pycodegen/pycodegen.pages.i10git.cs.fau.de/repos/pystencils/src/pystencils/codegen/target.py:207: UserWarning: Unable to determine available x86 vector CPU targets for this system: py-cpuinfo is not available.
  warn(
void kernel (const double * RESTRICT const _data_x, const int64 _size_x_0, const int64 _size_x_1, const int64 _size_x_2, const int64 _stride_x_0, const int64 _stride_x_1, const int64 _stride_x_2, double * const r)
{
   r_local: float64 = [0.0: float64];
   for(ctr_2: int64 = [0: int64]; ctr_2 < _size_x_2; ctr_2 += [1: int64])
   {
      for(ctr_1: int64 = [0: int64]; ctr_1 < _size_x_1; ctr_1 += [1: int64])
      {
         for(ctr_0: int64 = [0: int64]; ctr_0 < _size_x_0; ctr_0 += [1: int64])
         {
            r_local = r_local + _data_x[ctr_0 * _stride_x_0 + ctr_1 * _stride_x_1 + ctr_2 * _stride_x_2];
         }
      }
   }
   r[[0: int64]] = r[[0: int64]] + r_local;
}

Note

The generated reduction kernels may vary vastly for different platforms and optimizations. You can find a detailed description of configuration choices and their impact on the generated code below.

The kernel can be compiled and run immediately.

To execute the kernel on CPUs, not only a numpy.ndarray has to be passed for each field but also one for exporting reduction results. The export mechanism can be seen in the previously generated code snippet. Here, the kernel obtains a pointer with the name of the reduction symbol (here: r). This pointer is used for exporting the reduction result back from the kernel. Please note that the values passed via pointer will not be overwritten but will be incorporated in the reduction computation. Since our reduction result is a single scalar value, it is sufficient to set up an array comprising a singular value.

kernel_func = kernel.compile()

x_array = np.ones((4, 4, 4), dtype="float64")
reduction_result = np.zeros((1,), dtype="float64")

kernel_func(x=x_array, r=reduction_result)

reduction_result[0]
64.0

GPU Platforms#

Please note that reductions are currently only supported for CUDA platforms. Similar to the CPU section, a base variant for NVIDIA GPUs without explicitly employing any optimizations is shown:

gpu_cfg = ps.CreateKernelConfig(target=ps.Target.CUDA)

kernel_gpu = ps.create_kernel(assign_sum, gpu_cfg)

ps.inspect(kernel_gpu)
__global__ void kernel (const double * RESTRICT const _data_x, const int64 _size_x_0, const int64 _size_x_1, const int64 _size_x_2, const int64 _stride_x_0, const int64 _stride_x_1, const int64 _stride_x_2, double * const r)
{
   r_local: float64 = [0.0: float64];
   ctr_2: const int64 = (int64) blockIdx.z * (int64) blockDim.z + (int64) threadIdx.z;
   ctr_1: const int64 = (int64) blockIdx.y * (int64) blockDim.y + (int64) threadIdx.y;
   ctr_0: const int64 = (int64) blockIdx.x * (int64) blockDim.x + (int64) threadIdx.x;
   if(ctr_2 < _size_x_2 && ctr_1 < _size_x_1 && ctr_0 < _size_x_0)
   {
      r_local = r_local + _data_x[ctr_0 * _stride_x_0 + ctr_1 * _stride_x_1 + ctr_2 * _stride_x_2];
   }
   {
      if(ctr_2 < _size_x_2 - [0: int64] && ctr_1 < _size_x_1 - [0: int64] && ctr_0 < _size_x_0 - [0: int64])
      {
         atomicAdd(r, r_local);
      }
   }
}

The steps for running the generated code on NVIDIA GPUs are identical but the fields and the write-back pointer now require device memory, i.e. instances of cupy.ndarray.

Optimizations for Reductions#

Going beyond the aforementioned basic kernel configurations, we now demonstrate optimization strategies for different platforms that can be applied to reduction kernels and show what impact they have.

CPU Platforms#

For CPU platforms, standard optimizations are employing SIMD vectorization and shared-memory parallelism using OpenMP. The supported SIMD instruction sets for reductions are:

  • SSE3

  • AVX/AVX2

  • AVX512

Below you can see that an AVX vectorization was employed by using the target Target.X86_AVX. Note that reductions require assume_inner_stride_one to be enabled. This is due to the fact that other inner strides would require masked SIMD operations which are not supported yet.

# configure SIMD vectorization
cpu_cfg_opt = ps.CreateKernelConfig(
  target=ps.Target.X86_AVX,
)
cpu_cfg_opt.cpu.vectorize.enable = True
cpu_cfg_opt.cpu.vectorize.assume_inner_stride_one = True

# configure OpenMP parallelization
cpu_cfg_opt.cpu.openmp.enable = True
cpu_cfg_opt.cpu.openmp.num_threads = 8

kernel_cpu_opt = ps.create_kernel(assign_sum, cpu_cfg_opt)

ps.inspect(kernel_cpu_opt)
void kernel (const double * RESTRICT const _data_x, const int64 _size_x_0, const int64 _size_x_1, const int64 _size_x_2, const int64 _stride_x_1, const int64 _stride_x_2, double * const r)
{
   r_local: float64 = [0.0: float64];
   ctr_0__rem_start: const int64 = _size_x_0 / [4: int64] * [4: int64];
   {
      #pragma omp parallel num_threads(8) reduction(+: r_local)
      {
         r_local__1: __m256d = _mm256_set_pd([0.0: float64], [0.0: float64], [0.0: float64], [0.0: float64]);
         #pragma omp for schedule(static)
         for(ctr_2: int64 = [0: int64]; ctr_2 < _size_x_2; ctr_2 += [1: int64])
         {
            for(ctr_1: int64 = [0: int64]; ctr_1 < _size_x_1; ctr_1 += [1: int64])
            {
               for(ctr_0__1: int64 = [0: int64]; ctr_0__1 < ctr_0__rem_start; ctr_0__1 += [4: int64])
               {
                  ctr_0: const int64 = ctr_0__1;
                  r_local__1 = _mm256_add_pd(r_local__1, _mm256_loadu_pd(&_data_x[ctr_0 + ctr_1 * _stride_x_1 + ctr_2 * _stride_x_2]));
               }
               for(ctr_0__0: int64 = ctr_0__rem_start; ctr_0__0 < _size_x_0; ctr_0__0 += [1: int64])
               {
                  r_local = r_local + _data_x[ctr_0__0 + ctr_1 * _stride_x_1 + ctr_2 * _stride_x_2];
               }
            }
         }
         r_local = _mm256_horizontal_add_pd(r_local, r_local__1);
      }
   }
   r[[0: int64]] = r[[0: int64]] + r_local;
}

GPU Platforms#

As evident from the generated kernel for the base variant, atomic operations are employed for updating the pointer holding the reduction result. Using the explicit warp-level instructions provided by CUDA allows us to achieve higher performance compared to only using atomic operations. To generate kernels with warp-level reductions, the generator expects that CUDA block sizes are divisible by the hardware’s warp size. Similar to the SIMD configuration, we assure the code generator that the configured block size fulfills this criterion by enabling assume_warp_aligned_block_size. While the default block sizes provided by the code generator already fulfill this criterion, we employ a block fitting algorithm to obtain a block size that is also optimized for the kernel’s iteration space.

You can find more detailed information about warp size alignment in Walkthrough: Code Generation for GPUs.

gpu_cfg_opt = ps.CreateKernelConfig(target=ps.Target.CUDA)
gpu_cfg_opt.gpu.assume_warp_aligned_block_size = True
gpu_cfg_opt.gpu.warp_size = 32

kernel_gpu_opt = ps.create_kernel(assign_sum, gpu_cfg_opt)

kernel_func = kernel_gpu_opt.compile()
kernel_func.launch_config.fit_block_size((32, 1, 1))

ps.inspect(kernel_gpu_opt)
__global__ void kernel (const double * RESTRICT const _data_x, const int64 _size_x_0, const int64 _size_x_1, const int64 _size_x_2, const int64 _stride_x_0, const int64 _stride_x_1, const int64 _stride_x_2, double * const r)
{
   r_local: float64 = [0.0: float64];
   ctr_2: const int64 = (int64) blockIdx.z * (int64) blockDim.z + (int64) threadIdx.z;
   ctr_1: const int64 = (int64) blockIdx.y * (int64) blockDim.y + (int64) threadIdx.y;
   ctr_0: const int64 = (int64) blockIdx.x * (int64) blockDim.x + (int64) threadIdx.x;
   if(ctr_2 < _size_x_2 && ctr_1 < _size_x_1 && ctr_0 < _size_x_0)
   {
      r_local = r_local + _data_x[ctr_0 * _stride_x_0 + ctr_1 * _stride_x_1 + ctr_2 * _stride_x_2];
   }
   {
      r_local = r_local + __shfl_xor_sync(0xffffffff, r_local, [16: int32]);
      r_local = r_local + __shfl_xor_sync(0xffffffff, r_local, [8: int32]);
      r_local = r_local + __shfl_xor_sync(0xffffffff, r_local, [4: int32]);
      r_local = r_local + __shfl_xor_sync(0xffffffff, r_local, [2: int32]);
      r_local = r_local + __shfl_xor_sync(0xffffffff, r_local, [1: int32]);
      if(ctr_2 < _size_x_2 - [0: int64] && ctr_1 < _size_x_1 - [0: int64] && ctr_0 < _size_x_0 - [0: int64] && (threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * (blockDim.x * blockDim.y)) % [32: int32] == [0: int32])
      {
         atomicAdd(r, r_local);
      }
   }
}