Chen Chen, Yoav Tock, et al.
DEBS 2018
High performance computing systems display increasing complexity and component counts. This trend exposes weak-nesses in the underlying clustering infrastructure needed for continuous availability, maximizing utilization, and efficient administration of such systems. To mitigate the problem, we present a highly scalable clustering infrastructure, based on peer-to-peer technologies, for supporting resiliency-aware applications as well as efficient monitoring and load balancing. Supported services include Membership, Publishsubscribe messaging, Convergecast, Attribute replication and a DHT. We present a preliminary evaluation taken from an IBM BlueGene/P, demonstrating scalability up to ∼ 256K nodes.
Chen Chen, Yoav Tock, et al.
DEBS 2018
A. Manzalini, R. Minerva, et al.
ICIN 2013
Artem Barger, Liran Funaro, et al.
ICBC 2023
Jagabondhu Hazra, Kaushik Das, et al.
SC 2011