|
ABSTRACT
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors, since they fail to issue prefetches early enough (particularly for nonsequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of nonsequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach hides 50% or more tof the latecy remaining with the best previous techniques, while at the same time reduces the number of useless prefetches by a factor of six. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, and that the compiler can limit the code expansion to only 9% on average. In addition, we show that the performance of our technique can be further increased by using profiling information to help reduce cache conflicts and unnecessary prefetches. From an architectural perspective, these performance advantages are sustained over a range of common miss latencies and bandwidth. Finally, our technique is cost effective as well, since it delivers performance comparable to (or even better than) that of larger caches, but requires a much smaller hardware budget.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
 |
2
|
|
| |
3
|
|
 |
4
|
Po-Yung Chang , Eric Hao , Yale N. Patt, Target prediction for indirect jumps, Proceedings of the 24th annual international symposium on Computer architecture, p.274-283, June 01-04, 1997, Denver, Colorado, United States
|
 |
5
|
|
| |
6
|
|
| |
7
|
Nikolas Gloy , Trevor Blackwell , Michael D. Smith , Brad Calder, Procedure placement using temporal ordering information, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.303-313, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
8
|
Dirk Grunwald , Artur Klauser , Srilatha Manne , Andrew Pleszkun, Confidence estimation for speculation control, Proceedings of the 25th annual international symposium on Computer architecture, p.122-131, June 27-July 02, 1998, Barcelona, Spain
|
 |
9
|
Amir H. Hashemi , David R. Kaeli , Brad Calder, Efficient procedure mapping using cache line coloring, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.171-182, June 16-18, 1997, Las Vegas, Nevada, United States
|
 |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
|
 |
14
|
Ann Marie Grizzaffi Maynard , Colette M. Donnelly , Bret R. Olszewski, Contrasting characteristics and cache performance of technical and multi-user commercial workloads, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.145-156, October 05-07, 1994, San Jose, California, United States
|
 |
15
|
Todd C. Mowry , Monica S. Lam , Anoop Gupta, Design and evaluation of a compiler algorithm for prefetching, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.62-73, October 12-15, 1992, Boston, Massachusetts, United States
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
Glenn Reinman , Todd Austin , Brad Calder, A scalable front-end architecture for fast instruction delivery, Proceedings of the 26th annual international symposium on Computer architecture, p.234-245, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
21
|
Vatsa Santhanam , Edward H. Gornish , Wei-Chung Hsu, Data prefetching on the HP PA-8000, Proceedings of the 24th annual international symposium on Computer architecture, p.264-273, June 01-04, 1997, Denver, Colorado, United States
|
| |
22
|
SMITH, A. 1978. Sequential program prefetching in memory hierarchies. IEEE Computer 11, 2, 7-21.
|
 |
23
|
|
| |
24
|
|
| |
25
|
Jared Stark , Paul Racunas , Yale N. Patt, Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.34-43, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
26
|
WEBB, C. F. 1988. Subroutine call/Return stack. IBM Tech. Discl. Bull. 30, 11 (Apr.).
|
 |
27
|
|
| |
28
|
|
| |
29
|
YU, A. AND CHEN, J. 1996. The Postgres95 User Manual v1.0. University of California at Berkeley, Berkeley, CA.
|
CITED BY 4
|
|
|
|
|
|
|
|
Seung Woo Son , Mahmut Kandemir , Mustafa Karakoy , Dhruva Chakrabarti, A compiler-directed data prefetching scheme for chip multiprocessors, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
REVIEW
"Olivier Louis Marie Lecarme : Reviewer"
Although this somewhat long paper was published in ACM Transactions on Computer Systems, it could have been published in ACM Transactions on Programming Languages and Systems as well. In fact, a joint publication would have been bett
more...
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE Design Automation Conference on
Gwo-Dong Chen
, Daniel D. Gajski
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
|