Slowest to Fastest:
2 with sequential addressing.Further removing instruction bottlenecks:
NUMBA code (possible off-by-1s):
def reduce(out, a, size):
# TPB = threads per block
cache = cuda.shared.array(TPB, numba.float32)
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
local_i = cuda.threadIdx.x
stride = TPB//2
# First reduction on global read
if i + stride < size:
cache[local_i] = a[i] + a[i + stride]
cuda.syncthreads()
# Sequential addressing with thread-ids
while stride > 32:
if local_i < stride:
cache[local_i] = cache[local_i] + cache[local_i + stride]
cuda.syncthreads()
stride >>= 1
# Only 1 warp left, so no need to syncthreads()
if local_i < 32:
if local_i + 32 < TPB: cache[local_i] = cache[local_i] + cache[local_i + 32]
if local_i + 16 < TPB: cache[local_i] = cache[local_i] + cache[local_i + 16]
if local_i + 8 < TPB: cache[local_i] = cache[local_i] + cache[local_i + 8]
if local_i + 4 < TPB: cache[local_i] = cache[local_i] + cache[local_i + 4]
if local_i + 2 < TPB: cache[local_i] = cache[local_i] + cache[local_i + 2]
if local_i + 1 < TPB: cache[local_i] = cache[local_i] + cache[local_i + 1]
if i == 0:
out = cache
Update: An even faster approach for the final warp reduction is using warp shuffle intrinsics (__shfl_down_sync). These allow threads within a warp to directly exchange register values without going through shared memory at all. Since there's no shared memory access and no synchronization overhead (warps execute in lockstep), shuffle-based reductions have significantly lower latency. For a warp reduction, you'd do something like:
// Warp-level reduction using shuffles
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_down_sync(0xffffffff, val, offset);
}
This is the go-to approach for modern CUDA reduction kernels.