|
ABSTRACT
We have built a runtime compilation system that takes unmodified sequential binaries and improves their performance on off-the-shelf multiprocessors using dynamic vectorization and loop-level parallelization techniques. Our system, Azure, is purely software based and requires no specific hardware support for speculative thread execution, yet it is able to break even in most cases; that is, the achieved speedup exceeds the cost of runtime monitoring and compilation, often by significant amounts. Key to this remarkable performance is an offline preprocessing step that extracts a mostly correct control flow graph (CFG) from the binary program ahead of time. This statically obtained CFG is incomplete in that it may be missing some edges corresponding to computed branches. We describe how such additional control flow edges are discovered and handled at runtime, so that an incomplete static analysis never leads to an incorrect optimization result. The availability of a mostly correct CFG enables us to statically partition a binary executable into single-entry multiple-exit regions and to identify potential parallelization candidates ahead of execution. Program regions that are not candidates for parallelization can thereby be excluded completely from runtime monitoring and dynamic recompilation. Azure's extremely low overhead is a direct consequence of this design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Vasanth Bala , Evelyn Duesterwald , Sanjeev Banerjia, Dynamo: a transparent dynamic optimization system, Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, p.1-12, June 18-21, 2000, Vancouver, British Columbia, Canada
|
| |
3
|
Balakrishnan, G. and Reps, T. 2004. Analyzing memory accesses in x86 executables. In Proceedings of the Conference on Compiler Construction. Lecture Notes in Computer Science, vol. 2985, Springer Verlag, 5--23.
|
| |
4
|
Leonid Baraz , Tevi Devor , Orna Etzion , Shalom Goldenberg , Alex Skaletsky , Yun Wang , Yigel Zemach, IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems, Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.191, December 03-05, 2003
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
Anton Chernoff , Mark Herdeg , Ray Hookway , Chris Reeve , Norman Rubin , Tony Tye , S. Bharadwaj Yadavalli , John Yates, FX!32: A Profile-Directed Binary Translator, IEEE Micro, v.18 n.2, p.56-64, March 1998
[doi> 10.1109/40.671403]
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
 |
16
|
Brian Grant , Matthai Philipose , Markus Mock , Craig Chambers , Susan J. Eggers, An evaluation of staged run-time optimizations in DyC, Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, p.293-304, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
17
|
Lance Hammond , Benedict A. Hubbert , Michael Siu , Manohar K. Prabhu , Michael Chen , Kunle Olukotun, The Stanford Hydra CMP, IEEE Micro, v.20 n.2, p.71-84, March 2000
[doi> 10.1109/40.848474]
|
| |
18
|
Wen-Mei W. Hwu , Scott A. Mahlke , William Y. Chen , Pohua P. Chang , Nancy J. Warter , Roger A. Bringmann , Roland G. Ouellette , Richard E. Hank , Tokuzo Kiyohara , Grant E. Haab , John G. Holm , Daniel M. Lavery, The superblock: an effective technique for VLIW and superscalar compilation, The Journal of Supercomputing, v.7 n.1-2, p.229-248, May 1993
[doi> 10.1007/BF01205185]
|
| |
19
|
Kagan, M., Gochman, S., Orenstien, D., and Lin, D. 1997. MMX micro-architecture of Pentium processors with MMX technology and Pentium II micro-processors. Intel Techn. J. 8.
|
| |
20
|
|
 |
21
|
|
| |
22
|
Klaiber, A. 2000. The technology behind Crusoe processors. White Paper, Transmeta Corp.
|
 |
23
|
|
| |
24
|
|
| |
25
|
Krishnan, V. S. 1998. Speculative multithreading architectures. Tech. rep. UIUCDCS-R-98-2048, UIUC.
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
 |
31
|
Kunle Olukotun , Basem A. Nayfeh , Lance Hammond , Ken Wilson , Kunyung Chang, The case for a single-chip multiprocessor, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.2-11, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
32
|
Carlos García Quiñones , Carlos Madriles , Jesús Sánchez , Pedro Marcuello , Antonio González , Dean M. Tullsen, Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices, Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, June 12-15, 2005, Chicago, IL, USA
|
| |
33
|
|
| |
34
|
Eric Rotenberg , Quinn Jacobson , Yiannakis Sazeides , Jim Smith, Trace processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.138-148, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
35
|
Skoglund, J. and Felsberg, M. 2005. Fast image processing using SSE2. In Proceedings of the SSBA Symposium on Image Analysis.
|
 |
36
|
|
 |
37
|
|
| |
38
|
Thakkar, S. T. and Huff, T. 1999. The Internet Streaming SIMD Extensions. Intel Tech. J., 8.
|
| |
39
|
|
| |
40
|
|
 |
41
|
|
| |
42
|
|
 |
43
|
|
| |
44
|
|
 |
45
|
|
| |
46
|
|
|