Conference paper

PerfMon: Performance Monitoring of Host Network Stack

Abstract

Modern cloud applications are refactored into microservices, which are deployed as containers across multiple servers. An end-user request often triggers several remote procedure calls (RPCs) between these microservices. RPC latency anomalies caused by packet-processing delays (bottlenecks) in the host network stack are common. Bottlenecks at a few network components can compound across services, causing SLA violations for many requests.

Diagnosing RPC latency anomalies is challenging because many host-level components can contribute to the delay. Identifying the bottleneck component is a crucial first step. However, it often takes significant manual effort and expertise to find the bottleneck component due to the lack of visibility on per-component processing time. In this paper, we present PerfMon, a lightweight system designed to monitor the performance of components in the host stack that automatically identifies the bottleneck component. We develop PerfMon using eBPF technology and evaluate it on a Kubernetes-managed cluster of bare metal servers. Our evaluation demonstrates that PerfMon introduces minimal monitoring overheads while accurately identifying the bottleneck components.