ACM Home Page
Please provide us with feedback. Feedback
Performance driven data cache prefetching in a dynamic software optimization system
Full text PdfPdf (392 KB)
Source
International Conference on Supercomputing archive
Proceedings of the 21st annual international conference on Supercomputing table of contents
Seattle, Washington
SESSION: Architecture -- memory hierarchy table of contents
Pages: 202 - 209  
Year of Publication: 2007
ISBN:978-1-59593-768-1
Authors
Jean Christophe Beyler  Universit Louis Pasteur, Illkirch - France
Philippe Clauss  Universit Louis Pasteur, Illkirch - France
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 63,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1274971.1275000
What is a DOI?

ABSTRACT

Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximize their positive impact while minimizing their negative impact on performance. In previous proposed dynamic frameworks, the monitoring scheme is either achieved using processor performance counters or using specific hardware. In this work, we propose a prefetching strategy which does not use any specific hardware component or processor performance counter. Our dynamic framework wants to be portable on any modern processor architecture providing at least a prefetch instruction. Opportunity and effectiveness of prefetching loads is simply guided by the time spent to effectively obtain the data. Every load of a program is monitored periodically and can be either associated to a dynamically inserted prefetch instruction or not. It can be associated to a prefetch instruction at some disjoint periods of the whole program run as soon as it is efficient. Our framework has been implemented for Itanium-2 machines. It involves several dynamic instrumentations of the binary code whose overhead is limited to only 4% on average. On a large set of benchmarks, our system is able to speed up some programs by 2%--143%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The Olden benchmark suite. http://www.cs.princeton.edu/~mcc/olden.html.
 
2
Pointer-intensive benchmark suite. http://www.cs.wisc.edu/~austin/ptr-dist.html.
3
 
4
5
6
 
7
A. Das, R. Fu, A. Zhai, and W.-C. Hsu. Issues and support for dynamic register allocation. In Asia-Pacific Computer Systems Architecture Conference, pages 351--358, 2006.
 
8
 
9
10
 
11
 
12
J. Lu, H. Chen, P.-C. Yew, and W.-C. Hsu. Design and implementation of a lightweight dynamic optimization system. J. Instruction-Level Parallelism, 6, 2004.
 
13
 
14
SPEC CPU2000. http://www.spec.org/cpu2000/.
 
15
S. Srinath, O. M. H. Kim,, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proc. of the 13th Int. Symp. on High-Performance Computer Architecture (HPCA), Feb. 2007.
 
16
A. Srivastava, A. Edwards, and H. Vo. Vulcan: Binary Transformation in a Distributed Environment. Technical Report MSR-TR-2001-50, 2001.
17
18
 
19
 
20
 
21
Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In CGO '07: Proceedings of the International Symposium on Code Generation and Optimization, Washington, DC, USA, March 2007. IEEE Computer Society.
22


Collaborative Colleagues:
Jean Christophe Beyler: colleagues
Philippe Clauss: colleagues