ABSTRACT
The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for parallelism beyond local regions of code. To exploit these opportunities, especially for sequential applications, researchers have recently proposed global multi-threaded instruction scheduling techniques, including DSWP and GREMIO. These techniques simultaneously schedule instructions from large regions of code, such as arbitrary loop nests or whole procedures, and have been shown to be effective at extracting threads for many applications. A key enabler of these global instruction scheduling techniques is the Multi-Threaded Code Generation (MTCG) algorithm proposed in [16], which generates multi-threaded code for any partition of the instructions into threads. This algorithm inserts communication and synchronization instructions in order to satisfy all inter-thread dependences. In this paper, we present a general compiler framework, COCO, to optimize the communication and synchronization instructions inserted by the MTCG algorithm. This framework, based on thread-aware data-flow analyses and graph min-cut algorithms, appropriately models andoptimizes all kinds of inter-thread dependences, including register, memory, and control dependences. Our experiments, using a fully automatic compiler implementation of these techniques, demonstrate significant reductions (about 30% on average) in the number of dynamic communication instructions in code parallelized with DSWP and GREMIO. This reduction in communication translates to performance gains of up to 40%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Andrea Capitanio , Nikil Dutt , Alexandru Nicolau, Partitioned register files for VLIWs: a preliminary analysis of tradeoffs, Proceedings of the 25th annual international symposium on Microarchitecture, p.292-300, December 01-04, 1992, Portland, Oregon, United States
|
 |
3
|
Soumen Chakrabarti , Manish Gupta , Jong-Deok Choi, Global communication analysis and optimization, Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, p.68-78, May 21-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
4
|
|
 |
5
|
|
| |
6
|
L. R. Ford, Jr. and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962.
|
| |
7
|
|
 |
8
|
|
 |
9
|
Jens Knoop , Oliver Rüthing , Bernhard Steffen, Lazy code motion, Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, p.224-234, June 15-19, 1992, San Francisco, California, United States
|
 |
10
|
|
 |
11
|
Walter Lee , Rajeev Barua , Matthew Frank , Devabhaktuni Srikrishna , Jonathan Babb , Vivek Sarkar , Saman Amarasinghe, Space-time scheduling of instruction-level parallelism on a raw machine, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.46-57, October 02-07, 1998, San Jose, California, United States
|
| |
12
|
|
| |
13
|
|
| |
14
|
E. M. Nystrom, H.-S. Kim, and W.-M. Hwu. Bottom-up and top-down context-sensitive summary-based pointer analysis. In Proceedings of the 11th Static Analysis Symposium, August 2004.
|
| |
15
|
|
| |
16
|
Guilherme Ottoni , Ram Rangan , Adam Stoler , David I. August, Automatic Thread Extraction with Decoupled Software Pipelining, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.105-118, November 12-16, 2005, Barcelona, Spain
[doi> 10.1109/MICRO.2005.13]
|
| |
17
|
D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation, June 2005.
|
| |
18
|
Ram Rangan , Neil Vachharajani , Adam Stoler , Guilherme Ottoni , David I. August , George Z. N. Cai, Support for High-Frequency Streaming in CMPs, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.259-272, December 09-13, 2006
[doi> 10.1109/MICRO.2006.47]
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
 |
22
|
John W. Sias , Sain-zee Ueng , Geoff A. Kent , Ian M. Steiner , Erik M. Nystrom , Wen-mei W. Hwu, Field-testing IMPACT EPIC research results in Itanium 2, Proceedings of the 31st annual international symposium on Computer architecture, p.26, June 19-23, 2004, München, Germany
|
 |
23
|
|
| |
24
|
|
 |
25
|
Spyridon Triantafyllis , Matthew J. Bridges , Easwaran Raman , Guilherme Ottoni , David I. August, A framework for unrestricted whole-program optimization, Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, June 11-14, 2006, Ottawa, Ontario, Canada
|
| |
26
|
N. Vachharajani, M. Iyer, C. Ashok, M. Vachharajani, D. I. August, and D. A. Connors. Chip multi-processor scalability for single-threaded applications. In Proceedings of the Workshop on Design, Architecture, and Simulation of Chip Multi-Processors, November 2005.
|
| |
27
|
Neil Vachharajani , Ram Rangan , Easwaran Raman , Matthew J. Bridges , Guilherme Ottoni , David I. August, Speculative Decoupled Software Pipelining, Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, p.49-59, September 15-19, 2007
[doi> 10.1109/PACT.2007.66]
|
 |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
REVIEW
"Olivier Louis Marie Lecarme : Reviewer"
This interesting paper is another example of the extraordinary complications that compiler writers must endure if they want to take advantage, at least partly, of the theoretical capabilities of new multiprocessors. Since chip building is approach
more...
|