|
ABSTRACT
Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximize their positive impact while minimizing their negative impact on performance. In previous proposed dynamic frameworks, the monitoring scheme is either achieved using processor performance counters or using specific hardware. In this work, we propose a prefetching strategy which does not use any specific hardware component or processor performance counter. Our dynamic framework wants to be portable on any modern processor architecture providing at least a prefetch instruction. Opportunity and effectiveness of prefetching loads is simply guided by the time spent to effectively obtain the data. Every load of a program is monitored periodically and can be either associated to a dynamically inserted prefetch instruction or not. It can be associated to a prefetch instruction at some disjoint periods of the whole program run as soon as it is efficient. Our framework has been implemented for Itanium-2 machines. It involves several dynamic instrumentations of the binary code whose overhead is limited to only 4% on average. On a large set of benchmarks, our system is able to speed up some programs by 2%--143%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
The Olden benchmark suite. http://www.cs.princeton.edu/~mcc/olden.html.
|
| |
2
|
Pointer-intensive benchmark suite. http://www.cs.wisc.edu/~austin/ptr-dist.html.
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
 |
6
|
Michał Cierniak , Guei-Yuan Lueh , James M. Stichnoth, Practicing JUDO: Java under dynamic optimizations, Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, p.13-26, June 18-21, 2000, Vancouver, British Columbia, Canada
|
| |
7
|
A. Das, R. Fu, A. Zhai, and W.-C. Hsu. Issues and support for dynamic register allocation. In Asia-Pacific Computer Systems Architecture Conference, pages 351--358, 2006.
|
| |
8
|
Giuseppe Desoli , Nikolay Mateev , Evelyn Duesterwald , Paolo Faraboschi , Joseph A. Fisher, DELI: a new run-time control point, Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, November 18-22, 2002, Istanbul, Turkey
|
| |
9
|
|
 |
10
|
|
| |
11
|
Jiwei Lu , Howard Chen , Rao Fu , Wei-Chung Hsu , Bobbie Othmer , Pen-Chung Yew , Dong-Yuan Chen, The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System, Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.180, December 03-05, 2003
|
| |
12
|
J. Lu, H. Chen, P.-C. Yew, and W.-C. Hsu. Design and implementation of a lightweight dynamic optimization system. J. Instruction-Level Parallelism, 6, 2004.
|
| |
13
|
Chi-Keung Luk , Robert Muth , Harish Patil , Robert Cohn , Geoff Lowney, Ispike: A Post-link Optimizer for the Intel®Itanium®Architecture, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, p.15, March 20-24, 2004, Palo Alto, California
|
| |
14
|
SPEC CPU2000. http://www.spec.org/cpu2000/.
|
| |
15
|
S. Srinath, O. M. H. Kim,, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proc. of the 13th Int. Symp. on High-Performance Computer Architecture (HPCA), Feb. 2007.
|
| |
16
|
A. Srivastava, A. Edwards, and H. Vo. Vulcan: Binary Transformation in a Distributed Environment. Technical Report MSR-TR-2001-50, 2001.
|
 |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In CGO '07: Proceedings of the International Symposium on Code Generation and Optimization, Washington, DC, USA, March 2007. IEEE Computer Society.
|
 |
22
|
|
|