Conference paper

The Cloud, Like Building and Running Go Binaries

Abstract

We document the UX challenges of targeting distributed batch-processing code to Cloud resources. The challenges largely stem from the need for every team member to know everything about the mechanics of achieving scale. Thus, both running and writing code become overwhelming tasks. We present Lunchpail, a tool designed to address these challenges. We present two case studies, one of a team running AI/ML workloads and one of the code they wrote to make it happen. We quantify the challenges with two novel UX metrics: multiplicity and divergence. We show that the code base manifests a multitude of concerns, including distribution, packaging, and automation; 64–98% of the team’s code diverges from the main goal of the application. The story is paralleled when running workloads. Users switch between 3–7 types of tasks on a daily basis (high multiplicity). The nature of these tasks differ greatly from the users’ core competencies (high divergence). In particular, we show that all users assume the daily burdens of cluster operators. We demonstrate that four angles of attack, combined, can yield significant reductions in complexity: 1) Adopt a Serverless approach, allowing code to focus on that core “2%”. 2) Treat application packaging like building a Golang binary via go build. This binary embeds source, configuration, deployment logic, and a lightweight runtime that channels data to workers with fan-out and queuing. 3) Treat running distributed applications pipelines against Cloud resources like launching said binaries, with simple bash “|” syntax; 4) When possible, avoid multi-tenancy, and instead target Cloud virtual machines directly. We present a large experimental study to quantify the viability of obtaining a dedicated “burst” of cloud resources for every job run. We show VMs can be ready in well under a minute, which is 10-20x faster than scaling a Kubernetes cluster. We embody this approach in Lunchpail. Lunchpail itself is small, weighing in at 12k lines of code (10% of the size of Kubeflow, 2.5% of Ray, 1% of Kueue). We validate Lunchpail against AI/ML code, legacy chip design workloads, and show that it adds little overhead on top of acquiring Cloud VMs.

Related