Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2023
Modern cloud applications are refactored into microservices, which are deployed as containers across multiple servers. An end-user request often triggers several remote procedure calls (RPCs) between these microservices. RPC latency anomalies caused by packet-processing delays (bottlenecks) in the host network stack are common. Bottlenecks at a few network components can compound across services, causing SLA violations for many requests.
Diagnosing RPC latency anomalies is challenging because many host-level components can contribute to the delay. Identifying the bottleneck component is a crucial first step. However, it often takes significant manual effort and expertise to find the bottleneck component due to the lack of visibility on per-component processing time. In this paper, we present PerfMon, a lightweight system designed to monitor the performance of components in the host stack that automatically identifies the bottleneck component. We develop PerfMon using eBPF technology and evaluate it on a Kubernetes-managed cluster of bare metal servers. Our evaluation demonstrates that PerfMon introduces minimal monitoring overheads while accurately identifying the bottleneck components.
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2023
Apoorve Mohan, Matthew Sheard
NVIDIA GTC 2022
Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
Pranjal Gupta, Karan Bhukar, et al.
ICPE 2025