Efficient implementation of scatter-gather operations for large scale graph analytics
Abstract
We developed a methodology to improve the cache behavior and overall performance of sparse linear algebra kernels used in graph analytics. Large scale graph processing typically has low performance because it cannot effectively use processor caches, resulting in high-latency for memory accesses. This is particularly true in a sparse linear algebra formulations of graph algorithms, which use well known kernels such as sparse-matrix vector multiply and sparse-matrix transposition. The proposed methodology partitions the inner loops of linear algebra kernels, and executes the iterations over these partitions in an independently ordered manner. This produces more cache-friendly access patterns, which reduce cache misses and improve performance. We first illustrate the technique through an example and then generalize it to multiple scenarios. Our evaluation shows that execution time of key computational kernels is reduced by 2- to 5-fold, when large graphs are considered. Most of the improvement in execution time comes from a significant reduction in cycles-per-instruction (CPI), even as the path length of the executed code increases.