An analytic framework for detailed resource profiling in large and parallel programs and its application for memory use
Abstract
Profiling is an essential and widely used technique to understand the resource use of applications. For example, the memory use of large applications is becoming an important cost factor. Very large systems are typically sized to accommodate designated tasks, and thus, the price, as well as cache and TLB efficiency, depends significantly on the memory footprint of the target applications. Importantly, the increasing use of multicore systems magnifies the problem since memory use grows with the number of parallel tasks. Additionally, the presence of multiple tasks or threads makes the problem of correlating resource use to the program structure harder. Thus, tools that correlate resource use with program structure with quantitative error margins are essential for optimizing the resource use of complex software applications. While efficient tools for the profiling of execution time are available, the choices for detailed profiling of memory use or other hardware resources are very limited. We were unable to find tools that provided sufficiently accurate insight into, e.g., memory use without adding unacceptable overhead in memory use and execution time for the performance analysis of very large applications. In this paper, we present a highly efficient probabilistic method for profiling that provides detailed resource usage information R-{\Psi }(t) indexed by the full location descriptor \Psi (e.g., process id, thread id, and call chain) and time t. Importantly, we provide an analytical framework, which provides error estimates and allows to analyze and quantitatively optimize a wide variety of profiling scenarios. We employed the probabilistic approach to implement a memory profiling tool that adds minimal overhead and does not require recompilation or relinking. The tool provides the memory use M-\psi (t) for all location descriptors \psi over the execution time for single and multithreaded programs. Experimental results confirm that execution time and memory overhead are less than 10 percent of the unprofiled, optimized execution. Importantly, the technique is sufficiently general to be applicable to profiling of other hardware resources as cache or TLB misses over time for all location descriptors with similarly low overhead and across multiple processes, threads, and processors. © 2010 IEEE.