ACM Home Page
Please provide us with feedback. Feedback
Compiler orchestrated prefetching via speculation and predication
Full text PdfPdf (248 KB)
Source Architectural Support for Programming Languages and Operating Systems archive
Proceedings of the 11th international conference on Architectural support for programming languages and operating systems table of contents
Boston, MA, USA
SESSION: Memory system analysis and optimization table of contents
Pages: 189 - 198  
Year of Publication: 2004
ISBN:1-58113-804-0
Also published in ...
Authors
Rodric M. Rabbah  Massachusetts Institute of Technology
Hariharan Sandanagobalane  National University of Singapore
Mongkol Ekpanyapong  Georgia Institute of Technology
Weng-Fai Wong  National University of Singapore
Sponsors
SIGPLAN: ACM Special Interest Group on Programming Languages
SIGOPS: ACM Special Interest Group on Operating Systems
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 62,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1024393.1024416
What is a DOI?

ABSTRACT

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointer-chasing applications, and demonstrate how they may benefit from a novel scheme of it sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. Abraham and T. Johnson. Load sensitive scheduling. Personal Communication, HP Labs.
 
2
S. Abraham and B. R. Rau. Predicting load latencies using cache profiling. Technical Report HPL-94-110, HP Labs, Dec. 1994.
 
3
4
5
6
7
 
8
M. Charney and A. Reeves. Generalized correlation-based hardware prefetching. Technical Report EE-CEG-95-1, Cornell University, Feb. 1995.
 
9
10
 
11
12
 
13
J. Edler and M. Hill. Dinero IV trace-driven uniprocessor cache simulator. http://www.cs.wisc.edu/textasciitilde markhill/DineroIV/.
14
15
 
16
17
 
18
 
19
 
20
V. Kathail, M. Schlansker, and B. R. Rau. HPL-PD architecture specification: Version 1.1. Technical Report HPL-9380 (R.1), HP Labs, Feb. 2000.
 
21
22
 
23
24
25
26
27
28
 
29
Open Research Compiler for the I ntel I tanium. http://ipf-orc.sourceforge.net.
 
30
 
31
 
32
Performance application programming interface. http://icl.cs.utk.edu/papi/.
 
33
B. R. Rau. Iterative modulo scheduling. Technical Report Technical Report HPL-94-115, HP Labs, Nov. 1995.
34
35
36
 
37
R. Tomasulo. An efficient hardware algorithm for exploiting multiple arithmetic units. IBM Journal, 44-5:25--33, Jan. 1967.
 
38
Trimaran: An infrastructure for research in instruction level parallelism. http://www.trimaran.org.
 
39
40
41


Collaborative Colleagues:
Rodric M. Rabbah: colleagues
Hariharan Sandanagobalane: colleagues
Mongkol Ekpanyapong: colleagues
Weng-Fai Wong: colleagues