Whole-program optimization for time and space efficient threads
Abstract
Modem languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication latency, either with hardwares devices or users. Client-server applications typically use threads to simplify the complex control-flow that arises when multiple clients are used. Recently, the scientific computing community has started using threads to mask network communication latency in massively parallel architectures, allowing computation and communication to be overlapped. Lastly, some architectures implement threads in hardware, using those threads to tolerate memory latency. In general, it would be desirable if threaded programs could be written to expose the largest degree of parallelism possible, or to simplify the program design. However, threads incur time and space overheads, and programmers often compromise simple designs for performance. In this paper, we show how to reduce time and space thread overhead using control flow and register livcness information inferred after compilation. Our techniques work on binaries, are not specific to a particular compiler or thread library and reduce the the overall execution time of fine-grain threaded programs by ≈ 15-30%. We use execution-driven analysis and an instrumented operating system to show why the execution time is reduced and to indicate areas for future work.