Abstract
Although caches for decades have been the backbone of the memory system, the speed gap between CPU and main memory suggests their augmentation with prefetching mechanisms. Recently, sophisticated hardware correlating prefetching mechanisms have been proposed, in some cases coupled with some form of dead-block prediction. In many proposals, however correlating prefetchers demand a significant investment in hardware. In this paper we show that correlating prefetchers that work with tags instead of cache-line addresses are significantly more resource-efficient, providing equal or better performance than previous proposals. We support this claim by showing that per-set tag sequences exhibit highly repetitive patterns both within a set and across different sets. Because a single tag sequence can capture multiple address sequences spread over different cache sets, significant space savings can be achieved. We propose a tag-based prefetcher called a tag correlating prefetcher (TCP). Even with very small history tables, TCP outperforms address-based correlating prefetchers many times larger. In addition, we show that such a prefetcher can yield most of its performance benefits if placed at the L2 level of an aggressive out-of-order processor. Only if one wants prefetching all the way up to L1, is dead-block prediction required. Finally, we draw parallels between the two-level structure of TCP and similar structures for branch prediction mechanisms; these parallels raise interesting opportunities for improving correlating memory prefetchers by harnessing lessons already learned for correlating branch predictors.