|
ABSTRACT
Chip multiprocessors (CMPs), or multi-core processors, have become a common way of reducing chip complexity and power consumption while maintaining high performance. Speculative CMPs use hardware to enforce dependence, allowing a parallelizing compiler to generate multithreaded code without needing to prove independence. In these systems, a sequential program is decomposed into threads to be executed in parallel; dependent threads cause performance degradation, but do not affect correctness. Thread decomposition attempts to reduce the run-time overheads of data dependence, thread misprediction, and load imbalance. Because these overheads depend on the runtimes of the threads that are being created by the decomposition, reducing the overheads while creating the threads is a circular problem. Static compile-time decomposition handles this problem by estimating the run times of the candidate threads, but is limited by the estimates' inaccuracy. Dynamic execution-time decomposition in hardware has better run-time information, but is limited by the decomposition hardware's complexity and run-time overhead. We propose a third approach where a compiler instruments a profile run of the application to search through candidate threads and pick the best threads as the profile run executes. The resultant decomposition is compiled into the application so that a production run of the application has no instrumentation and does not incurany decomposition overhead. We avoid static decomposition's estimation accuracy problem by using actual profile-run execution times to pick threads, and we avoid dynamic decomposition's overhead by performing the decomposition at profile time. Because we allow candidate threads to span arbitrary sections of the application's call graph and loop nests, an exhaustive search of the decomposition space is prohibitive, even in profile runs. To address this issue, we make the key observation that the run-time overhead of a thread depends, to the first order, only on threads that overlap with the thread inexecution (e.g., in a four-core CMP, a given thread can overlap with at most three preceding and three following threads). This observation implies that a given thread affects only a few other threads, allowing pruning of the space. Using a CMP simulator, we achieve an average speedup of 3.51 on four cores for five of the SPEC CFP2000 benchmarks, which compares favorably to recent static techniques. We also discuss experiments with CINT2000.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, The anatomy of the register file in a multiscalar processor, Proceedings of the 27th annual international symposium on Microarchitecture, p.181-190, November 30-December 02, 1994, San Jose, California, United States
[doi> 10.1145/192724.192750]
|
| |
3
|
M. Byler et al. Multiple Version Loops. In Proceedings of the International Conference on Parallel Processing, pages 312--318, August 1987.
|
 |
4
|
|
| |
5
|
L. Codrescu and D. S. Wills. On Dynamic Speculative Thread Partitioning and the MEM-Slicing Algorithm. Journal of Universal Computer Science, 6(10):908--914, 2000.
|
| |
6
|
K. Cooper, M. Hall, and K. Kennedy. A Methodology for Procedure Cloning. Computer Languages, 19(2):105--117, April 1993.
|
 |
7
|
|
 |
8
|
Zhao-Hui Du , Chu-Cheow Lim , Xiao-Feng Li , Chen Yang , Qingyu Zhao , Tin-Fook Ngai, A cost-driven compilation framework for speculative parallelization of sequential programs, Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, June 09-11, 2004, Washington DC, USA
|
| |
9
|
S. I. Feldman, D. M. Gay, M. W. Maimone, and N. L. Schryer. A Fortran to C Converter. Technical report, AT&T Bell Laboratories, March 1995.
|
| |
10
|
|
| |
11
|
María Jesús Garzarán , Milos Prvulovic , José María Llabería , Víctor Viñals , Lawrence Rauchwerger , Josep Torrellas, Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors, Proceedings of the 9th International Symposium on High-Performance Computer Architecture, p.191, February 08-12, 2003
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
Lance Hammond , Mark Willey , Kunle Olukotun, Data speculation support for a chip multiprocessor, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.58-69, October 02-07, 1998, San Jose, California, United States
|
 |
16
|
|
| |
17
|
T. A. Johnson et al. Experiences in Using Cetus for Source-to-Source Transformations. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 1--14, September 2004.
|
 |
18
|
Seon Wook Kim , Chong-liang Ooi , Rudolf Eigenmann , Babak Falsafi , T. N. Vijaykumar, Reference idempotency analysis: a framework for optimizing speculative execution, Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, p.2-11, June 2001, Snowbird, Utah, United States
|
| |
19
|
|
 |
20
|
|
 |
21
|
Wei Liu , James Tuck , Luis Ceze , Wonsun Ahn , Karin Strauss , Jose Renau , Josep Torrellas, POSH: a TLS compiler that exploits program structure, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
[doi> 10.1145/1122971.1122997]
|
 |
22
|
|
 |
23
|
Andreas Moshovos , Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th annual international symposium on Computer architecture, p.181-193, June 01-04, 1997, Denver, Colorado, United States
|
 |
24
|
|
| |
25
|
|
 |
26
|
|
 |
27
|
|
| |
28
|
Eric Rotenberg , Quinn Jacobson , Yiannakis Sazeides , Jim Smith, Trace processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.138-148, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
29
|
|
 |
30
|
|
| |
31
|
|
 |
32
|
J. Greggory Steffan , Christopher B. Colohan , Antonia Zhai , Todd C. Mowry, A scalable approach to thread-level speculation, Proceedings of the 27th annual international symposium on Computer architecture, p.1-12, June 2000, Vancouver, British Columbia, Canada
|
| |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
 |
37
|
|
| |
38
|
Elliot Waingold , Michael Taylor , Devabhaktuni Srikrishna , Vivek Sarkar , Walter Lee , Victor Lee , Jang Kim , Matthew Frank , Peter Finch , Rajeev Barua , Jonathan Babb , Saman Amarasinghe , Anant Agarwal, Baring It All to Software: Raw Machines, Computer, v.30 n.9, p.86-93, September 1997
[doi> 10.1109/2.612254]
|
| |
39
|
|
|