Venkatesan T. Chakaravarthy, Fabio Checconi, et al.
IEEE TPDS
All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-Transform (FFT) algorithm. We analyze the performance of all-to-all communication on the Blue Gene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages. In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 × 32 × 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TFfor the HPC Challenge ID FFT benchmark with our optimized all-to-all. © 2008 IEEE.
Venkatesan T. Chakaravarthy, Fabio Checconi, et al.
IEEE TPDS
Preeti Malakar, Thomas George, et al.
SC 2012
Sameer Kumar, Chao Huang, et al.
IBM J. Res. Dev
Anamitra Roy Choudhury, Alan King, et al.
IPDPS 2008