ACM Home Page
Please provide us with feedback. Feedback
Speculative thread decomposition through empirical optimization
Full text PdfPdf (512 KB)
Source
Principles and Practice of Parallel Programming archive
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming table of contents
San Jose, California, USA
SESSION: Thread-level speculation table of contents
Pages: 205 - 214  
Year of Publication: 2007
ISBN:978-1-59593-602-8
Authors
Troy A. Johnson  Purdue University, West Lafayette, IN
Rudolf Eigenmann  Purdue University, West Lafayette, IN
T. N. Vijaykumar  Purdue University, West Lafayette, IN
Sponsors
ACM: Association for Computing Machinery
SIGPLAN: ACM Special Interest Group on Programming Languages
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 75,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1229428.1229474
What is a DOI?

ABSTRACT

Chip multiprocessors (CMPs), or multi-core processors, have become a common way of reducing chip complexity and power consumption while maintaining high performance. Speculative CMPs use hardware to enforce dependence, allowing a parallelizing compiler to generate multithreaded code without needing to prove independence. In these systems, a sequential program is decomposed into threads to be executed in parallel; dependent threads cause performance degradation, but do not affect correctness. Thread decomposition attempts to reduce the run-time overheads of data dependence, thread misprediction, and load imbalance. Because these overheads depend on the runtimes of the threads that are being created by the decomposition, reducing the overheads while creating the threads is a circular problem. Static compile-time decomposition handles this problem by estimating the run times of the candidate threads, but is limited by the estimates' inaccuracy. Dynamic execution-time decomposition in hardware has better run-time information, but is limited by the decomposition hardware's complexity and run-time overhead. We propose a third approach where a compiler instruments a profile run of the application to search through candidate threads and pick the best threads as the profile run executes. The resultant decomposition is compiled into the application so that a production run of the application has no instrumentation and does not incurany decomposition overhead. We avoid static decomposition's estimation accuracy problem by using actual profile-run execution times to pick threads, and we avoid dynamic decomposition's overhead by performing the decomposition at profile time. Because we allow candidate threads to span arbitrary sections of the application's call graph and loop nests, an exhaustive search of the decomposition space is prohibitive, even in profile runs. To address this issue, we make the key observation that the run-time overhead of a thread depends, to the first order, only on threads that overlap with the thread inexecution (e.g., in a four-core CMP, a given thread can overlap with at most three preceding and three following threads). This observation implies that a given thread affects only a few other threads, allowing pruning of the space. Using a CMP simulator, we achieve an average speedup of 3.51 on four cores for five of the SPEC CFP2000 benchmarks, which compares favorably to recent static techniques. We also discuss experiments with CINT2000.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
M. Byler et al. Multiple Version Loops. In Proceedings of the International Conference on Parallel Processing, pages 312--318, August 1987.
4
 
5
L. Codrescu and D. S. Wills. On Dynamic Speculative Thread Partitioning and the MEM-Slicing Algorithm. Journal of Universal Computer Science, 6(10):908--914, 2000.
 
6
K. Cooper, M. Hall, and K. Kennedy. A Methodology for Procedure Cloning. Computer Languages, 19(2):105--117, April 1993.
7
8
 
9
S. I. Feldman, D. M. Gay, M. W. Maimone, and N. L. Schryer. A Fortran to C Converter. Technical report, AT&T Bell Laboratories, March 1995.
 
10
 
11
12
 
13
 
14
15
16
 
17
T. A. Johnson et al. Experiences in Using Cetus for Source-to-Source Transformations. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 1--14, September 2004.
18
 
19
20
21
22
23
24
 
25
26
27
 
28
29
30
 
31
32
 
33
 
34
 
35
 
36
37
 
38
 
39


Collaborative Colleagues:
Troy A. Johnson: colleagues
Rudolf Eigenmann: colleagues
T. N. Vijaykumar: colleagues