Characterizing the Performance Bottlenecks of Irregular GPU Kernels
MetadataShow full metadata
Graphics processing units (GPUs) are increasingly being used to accelerate general-purpose applications, including applications with data-dependent, irregular memory access patterns and control flow. However, relatively little is known about the behavior of irregular GPU codes, and there has been minimal effort to quantify the ways in which they differ from regular general-purpose GPU applications. I examine the behavior of a suite of optimized irregular GPU applications written in CUDA on a cycle-level GPU simulator. I characterize the performance bottlenecks in each program and connect source code to microarchitectural performance characteristics. I also assess the performance impact of modifying hardware parameters such as the cache and DRAM bandwidths and latencies, data cache sizes, coalescing behavior, and warp scheduling policy, and I discuss the implications for future GPU architecture design. I find that, while irregular graph codes exhibit significantly more underutilized execution cycles due to branch divergence, load imbalance, and synchronization overhead than regular programs, overall, code optimizations are often able to effectively address these performance hurdles. Insufficient bandwidth, long memory latency, and poor cache effectiveness are the biggest limiters of performance. Applications with irregular memory access patterns are more sensitive to changes in L2 latency and bandwidth than DRAM latency and bandwidth. Additionally, greedy-then-oldest scheduling is the best simple warp scheduler for irregular codes, and two-level scheduling does not significantly improve the performance of such codes.