Higher-Order and Tuple-Based Massively-Parallel Prefix Sums
Abstract
Prefix sums are an important parallel primitive, especially in massively-parallel programs. This thesis discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that supports higher orders and tuple values. Our templated CUDA implementation unifies all of these compu- tations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On very large inputs, it is even faster than CUB on the Titan X. SAM outperforms CUB by up to a factor of 2.9 on higher-order prefix sums and by up to a factor of 2.6 on tuple-based prefix sums.