|
ABSTRACT
Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts).We prototype our algorithms on the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [Lyle and Wallace 1997]. Using our prototype, we undertake a performance evaluation of our compilers on a detailed architectural simulator of an 8-way out-of-order SMT processor with 4 hardware contexts, and 13 applications selected from the SPEC and Olden benchmark suites. Our most aggressive compiler improves the performance of 10 out of 13 applications, reducing execution time by 20.9%. Across all 13 applications, our aggressive compiler achieves a harmonic average speedup of 17.6%. For our reduced compilers, eliminating program slicing and relying on back-end optimizations degrades performance minimally, suggesting that effective pre-execution compilers can be built without program slicing. Furthermore, without cache-miss profiles, we still achieve good speedup, 15.5%, but without loop-trip count profiles, we achieve a speedup of only 7.7%. Finally, our results show compiler-based pre-execution can benefit multiprogrammed workloads. Simultaneously executing applications achieve higher throughput with pre-execution compared to no pre-execution. Due to contention for hardware contexts, however, time-slicing outperforms simultaneous execution in some cases where individual applications make heavy use of pre-execution threads.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Santosh G. Abraham , Rabin A. Sugumar , Daniel Windheiser , B. R. Rau , Rajiv Gupta, Predictability of load/store instruction latencies, Proceedings of the 26th annual international symposium on Microarchitecture, p.139-152, December 01-03, 1993, Austin, Texas, United States
|
| |
2
|
Anderson, J. M., Berc, L. M., Dean, J., Ghemawat, S., Henzinger, M. R., Leung, S.-T. A., Sites, R. L., Vandevoorde, M. T., Waldspurger, C. A., and Weihl, W. E. 1997. Continuous profiling: Where have all the cycles gone? SRC Technical Note 1997-016a, Digital. July.
|
 |
3
|
|
| |
4
|
Binkley, D. and Gallagher, K. 1996. A Survey of Program Slicing. Academic Press, Orlando, Fla.
|
| |
5
|
Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. CS TR 1342, University of Wisconsin-Madison, Madison, Wisc., June.
|
| |
6
|
|
 |
7
|
Robert S. Chappell , Jared Stark , Sangwook P. Kim , Steven K. Reinhardt , Yale N. Patt, Simultaneous subordinate microthreading (SSMT), Proceedings of the 26th annual international symposium on Computer architecture, p.186-195, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
8
|
Robert S. Chappell , Francis Tseng , Adi Yoaz , Yale N. Patt, Difficult-path branch prediction using subordinate microthreads, Proceedings of the 29th annual international symposium on Computer architecture, p.307, May 25-29, 2002, Anchorage, Alaska
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
Jamison D. Collins , Hong Wang , Dean M. Tullsen , Christopher Hughes , Yong-Fong Lee , Dan Lavery , John P. Shen, Speculative precomputation: long-range prefetching of delinquent loads, Proceedings of the 28th annual international symposium on Computer architecture, p.14-25, June 30-July 04, 2001, Göteborg, Sweden
|
| |
13
|
Cytron, R. 1986. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing. (University Park, PA). IEEE Computer Society Press, Los Alamitos, Calif., 836--844.
|
| |
14
|
Dubois, M. and Song, Y. H. 1998. Assisted execution. CENG Technical Report 98-25, Department of EE-Systems, University of Southern California. October.
|
 |
15
|
|
| |
16
|
Alexandre Farcy , Olivier Temam , Roger Espasa , Toni Juan, Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes, Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, p.59-68, November 1998, Dallas, Texas, United States
|
 |
17
|
|
 |
18
|
|
 |
19
|
Steve S.W. Liao , Perry H. Wang , Hong Wang , Gerolf Hoflehner , Daniel Lavery , John P. Shen, Post-pass binary adaptation for software-based speculative precomputation, Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, June 17-19, 2002, Berlin, Germany
|
 |
20
|
|
| |
21
|
Lyle, J. R. and Wallace, D. R. May 1997. Using the unravel program slicing tool to evaluate high integrity software. In Proceedings of 10th International Software Quality Week (San Francisco, Calif.).
|
| |
22
|
Lyle, J. R., Wallace, D. R., Graham, J. R., Gallagher, K. B., Poole, J. P., and Binkley, D. W. 1995. Unravel: A CASE tool to assist evaluation of high integrity software. NISTIR 5691, National Institute of Standards and Technology. August.
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
| |
26
|
Padua, D. A., Kuck, D. J., and Lawrie, D. H. 1980. High-speed multiprocessors and compilation techniques. IEEE Trans. Comput. C-29, 9 (Sept.), 763--776.
|
 |
27
|
|
| |
28
|
|
 |
29
|
|
 |
30
|
Amir Roth , Andreas Moshovos , Gurindar S. Sohi, Dependence based prefetching for linked data structures, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.115-126, October 02-07, 1998, San Jose, California, United States
|
 |
31
|
|
| |
32
|
|
| |
33
|
|
 |
34
|
|
| |
35
|
SPEC. 2000. SPEC CPU2000 V1.2 (http://www.specbench.org/osg/cpu2000/).
|
 |
36
|
|
 |
37
|
Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , Jack L. Lo , Rebecca L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd annual international symposium on Computer architecture, p.191-202, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
38
|
|
| |
39
|
|
| |
40
|
Weiser, M. 1984. Program slicing. IEEE Trans. Softw. Eng. SE-10, 4 (July).
|
 |
41
|
|
 |
42
|
|
| |
43
|
|
CITED BY 3
|
|
Tanping Wang , Filip Blagojevic , Dimitrios S. Nikolopoulos, Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-12, October 22-23, 2004, Houston, Texas
|
|
|
Guilherme Ottoni , Ram Rangan , Adam Stoler , David I. August, Automatic Thread Extraction with Decoupled Software Pipelining, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.105-118, November 12-16, 2005, Barcelona, Spain
|
|
|
Yong Chen , Surendra Byna , Xian-He Sun , Rajeev Thakur , William Gropp, Hiding I/O latency with pre-execution prefetching for parallel applications, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
INDEX TERMS
Primary Classification:
B.
Hardware
B.8
Performance and Reliability
B.8.2
Performance Analysis and Design Aids
Additional Classification:
C.
Computer Systems Organization
C.0
GENERAL
Subjects:
System architectures;
Modeling of computer architecture
C.4
PERFORMANCE OF SYSTEMS
Subjects:
Design studies
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Compilers
General Terms:
Algorithms,
Design,
Experimentation,
Measurement,
Performance
Keywords:
Data prefetching,
memory-level parallelism,
multithreading,
pre-execution,
prefetch conversion,
program slicing,
speculative loop parallelization
|