|
ABSTRACT
Multi-cores are becoming ubiquitous as exemplified by Sun's Niagra-2, Intel's Nehalem and AMD's Sau Paulo octal cores. The number of cores per chip is expected to rise in foreseeable future, as evidenced by the recently announced Intel's 80-core Teraflops Research Chip. Exploiting the parallelism of multicores necessitates concurrent software. One way to parallelize programs, not amenable to auto-parallelization, is via explicit synchronization. The placement of the synchronization primitives has a large bearing on how much thread-level parallelism (TLP) can be achieved. In this paper, we propose novel predication-based and other adjunct synchronization optimizations which facilitate exploitation on higher level of TLP than what can be achieved using the state-of-the-art. We demonstrate the efficacy of our techniques, on a real machine, using real codes, specifically, from the industry-standard SPEC CPU benchmarks and other widely used open source codes such as PostgreSQL. Our results show that the proposed techniques yield significantly higher levels of TLP than the state-of-the-art.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.
|
| |
2
|
|
 |
3
|
|
| |
4
|
OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.
|
| |
5
|
SPEC CPU Benchmarks. http://www.spec.org/benchmarks.html.
|
| |
6
|
PostgreSQL. http://www.postgresql.org/.
|
 |
7
|
Arun Kejariwal , Xinmin Tian , Wei Li , Milind Girkar , Sergey Kozhukhov , Hideki Saito , Utpal Banerjee , Alexandru Nicolau , Alexander V. Veidenbaum , Constantine D. Polychronopoulos, On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
[doi> 10.1145/1183401.1183407]
|
| |
8
|
SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
L. Shen, Z. Wang, and J. Lu. Predicate analysis based on path information. In Advanced Parallel Processing Technologies, LNCS, 2834, pages 147--151, 2003.
|
 |
17
|
J. R. Allen , Ken Kennedy , Carrie Porterfield , Joe Warren, Conversion of control dependence to data dependence, Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, p.177-189, January 24-26, 1983, Austin, Texas
[doi> 10.1145/567067.567085]
|
| |
18
|
J. Park and M. Schlansker. On predicated execution. Technical Report 58--91, Hewlett Packard Laboratories, 1991.
|
 |
19
|
|
| |
20
|
David M. Gillies , Dz-ching Roy Ju , Richard Johnson , Michael Schlansker, Global predicate analysis and its application to register allocation, Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, p.114-125, December 02-04, 1996, Paris, France
|
| |
21
|
R. Fitzgerald, T. B. Knoblock, E. Ruf, B. Steensgaard, and D. Tarditi. Marmot: an optimizing compiler for Java. Technical report, Microsoft, November 1998.
|
| |
22
|
|
 |
23
|
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
 |
27
|
Weirong Zhu , Vugranam C Sreedhar , Ziang Hu , Guang R. Gao, Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures, Proceedings of the 34th annual international symposium on Computer architecture, June 09-13, 2007, San Diego, California, USA
|
 |
28
|
Arun Kejariwal , Hideki Saito , Xinmin Tian , Milind Girkar , Wel Li , Utpal Banerjee , Alexandru Nicolau , Constantine D. Polychronopoulos, Lightweight lock-free synchronization methods for multithreading, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
[doi> 10.1145/1183401.1183452]
|
 |
29
|
|
| |
30
|
W. M. Pottenger. Induction variable substitution and reduction recognition in the Polaris parallelizing compiler. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1995.
|
 |
31
|
|
 |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
A. V. Aho and J. D. Ullman. The theory of parsing, translation and compiling. In Compiling. Prentice-Hall, Englewood Cliffs, NJ, 1973.
|
 |
36
|
D. J. Kuck , R. H. Kuhn , D. A. Padua , B. Leasure , M. Wolfe, Dependence graphs and compiler optimizations, Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, p.207-218, January 26-28, 1981, Williamsburg, Virginia
[doi> 10.1145/567532.567555]
|
| |
37
|
R. Cytron and J. Ferrante. What's in a name? or the value of renaming for parallelism detection and storage allocation. In Proceedings of the 1987 International Conference on Parallel Processing, St. Charles, IL, August 1987.
|
| |
38
|
S. Weatherford. High level pattern matching extension to C++ for fortran program manipulation in Polaris. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1994.
|
| |
39
|
Wine. http://sourceforge.net/project/showfiles.php?group id=6241.
|
| |
40
|
SPEC CPU2006. http://www.spec.org/cpu2006.
|
| |
41
|
MySQL: The world's most popular open source database. http://www.mysql.com/.
|
| |
42
|
A. Kejariwal and A. Nicolau. Reading list of mutual exclusion, locking, synchronization and concurrent objects. http://www.ics.uci.edu/-akejariw/ConcurrentExecutionReadingList.pdf.
|
| |
43
|
R. Cytron. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing, pages 836--844, St. Charles, IL, August 1986.
|
| |
44
|
S. Midkiff and D. Padua. Compiler generated synchronization for DO loops. In Proceedings of the 1986 International Conference on Parallel Processing, pages 544--551, St. Charles, IL, August 1986.
|
| |
45
|
H. Kasahara, H. Honda, M. Iwata, and M. Hirota. A compilation scheme for macro-dataow computation on hierarchical multiprocessor systems. In Proceedings of the International Conference on Parallel Processing, pages II294--II295, Urbana-Champaign, IL, August 1990.
|
| |
46
|
|
 |
47
|
R. Cytron , M. Hind , W. Hsieh, Automatic generation of DAG parallelism, Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, p.54-68, June 19-23, 1989, Portland, Oregon, United States
|
 |
48
|
|
| |
49
|
W. N. Scherer III and M. L. Scott. Nonblocking concurrent objects with condition synchronization. In Proceedings of the 18th International Symposium on Distributed Computing, Amsterdam, The Netherlands, Oct 2004.
|
| |
50
|
F. Fich, D. Hendler, and N. Shavit. On the inherent weakness of conditional synchronization primitives. http://www.cs.tau.ac.il/-afek/Handler-conditionals.pdf.
|
 |
51
|
|
 |
52
|
|
| |
53
|
|
 |
54
|
|
 |
55
|
|
| |
56
|
S. S. Bhattacharyya, S. Sriram, and E. A. Lee. Resynchronization of multiprocessor schedules Part I: Fundamental concepts and unbounded latency analysis. Memorandum UCB/ERL M96/55, Electronics Research Laboratory, U. C. Berkeley, October 1996.
|
| |
57
|
|
| |
58
|
|
| |
59
|
|
| |
60
|
|
 |
61
|
Glenn Reinman , Brad Calder , Dean Tullsen , Gary Tyson , Todd Austin, Classifying load and store instructions for memory renaming, Proceedings of the 13th international conference on Supercomputing, p.399-407, June 20-25, 1999, Rhodes, Greece
[doi> 10.1145/305138.305218]
|
| |
62
|
K. Knobe and V. Sarkar. Array SSA form and its use in parallelization. pages 107--120, San Diego, CA, 1998.
|
| |
63
|
|
 |
64
|
Silvius Rus , Guobin He , Christophe Alias , Lawrence Rauchwerger, Region array SSA, Proceedings of the 15th international conference on Parallel architectures and compilation techniques, September 16-20, 2006, Seattle, Washington, USA
[doi> 10.1145/1152154.1152165]
|
|