ACM Home Page
Please provide us with feedback. Feedback
A study of source-level compiler algorithms for automatic construction of pre-execution code
Full text PdfPdf (1.55 MB)
Source ACM Transactions on Computer Systems (TOCS) archive
Volume 22 ,  Issue 3  (August 2004) table of contents
Pages: 326 - 379  
Year of Publication: 2004
ISSN:0734-2071
Authors
Dongkeun Kim  University of Maryland at College Park, MD
Donald Yeung  University of Maryland at College Park, MD
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): n/a,   Downloads (12 Months): n/a,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1012268.1012270
What is a DOI?

ABSTRACT

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts).We prototype our algorithms on the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [Lyle and Wallace 1997]. Using our prototype, we undertake a performance evaluation of our compilers on a detailed architectural simulator of an 8-way out-of-order SMT processor with 4 hardware contexts, and 13 applications selected from the SPEC and Olden benchmark suites. Our most aggressive compiler improves the performance of 10 out of 13 applications, reducing execution time by 20.9%. Across all 13 applications, our aggressive compiler achieves a harmonic average speedup of 17.6%. For our reduced compilers, eliminating program slicing and relying on back-end optimizations degrades performance minimally, suggesting that effective pre-execution compilers can be built without program slicing. Furthermore, without cache-miss profiles, we still achieve good speedup, 15.5%, but without loop-trip count profiles, we achieve a speedup of only 7.7%. Finally, our results show compiler-based pre-execution can benefit multiprogrammed workloads. Simultaneously executing applications achieve higher throughput with pre-execution compared to no pre-execution. Due to contention for hardware contexts, however, time-slicing outperforms simultaneous execution in some cases where individual applications make heavy use of pre-execution threads.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Anderson, J. M., Berc, L. M., Dean, J., Ghemawat, S., Henzinger, M. R., Leung, S.-T. A., Sites, R. L., Vandevoorde, M. T., Waldspurger, C. A., and Weihl, W. E. 1997. Continuous profiling: Where have all the cycles gone? SRC Technical Note 1997-016a, Digital. July.
3
 
4
Binkley, D. and Gallagher, K. 1996. A Survey of Program Slicing. Academic Press, Orlando, Fla.
 
5
Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. CS TR 1342, University of Wisconsin-Madison, Madison, Wisc., June.
 
6
7
8
 
9
 
10
 
11
12
 
13
Cytron, R. 1986. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing. (University Park, PA). IEEE Computer Society Press, Los Alamitos, Calif., 836--844.
 
14
Dubois, M. and Song, Y. H. 1998. Assisted execution. CENG Technical Report 98-25, Department of EE-Systems, University of Southern California. October.
15
 
16
17
18
19
20
 
21
Lyle, J. R. and Wallace, D. R. May 1997. Using the unravel program slicing tool to evaluate high integrity software. In Proceedings of 10th International Software Quality Week (San Francisco, Calif.).
 
22
Lyle, J. R., Wallace, D. R., Graham, J. R., Gallagher, K. B., Poole, J. P., and Binkley, D. W. 1995. Unravel: A CASE tool to assist evaluation of high integrity software. NISTIR 5691, National Institute of Standards and Technology. August.
 
23
24
25
 
26
Padua, D. A., Kuck, D. J., and Lawrie, D. H. 1980. High-speed multiprocessors and compilation techniques. IEEE Trans. Comput. C-29, 9 (Sept.), 763--776.
27
 
28
29
30
31
 
32
 
33
34
 
35
SPEC. 2000. SPEC CPU2000 V1.2 (http://www.specbench.org/osg/cpu2000/).
36
37
 
38
 
39
 
40
Weiser, M. 1984. Program slicing. IEEE Trans. Softw. Eng. SE-10, 4 (July).
41
42
 
43


Collaborative Colleagues:
Dongkeun Kim: colleagues
Donald Yeung: colleagues