| Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor |
| Full text |
Pdf
(1.56 MB)
|
| Source
|
International Symposium on Computer Architecture
archive
Proceedings of the 25th annual international symposium on Computer architecture
table of contents
Barcelona, Spain
Pages: 306 - 317
Year of Publication: 1998
ISBN:0-8186-8491-7
Also published in ...
|
|
Authors
|
|
Stephen W. Keckler
|
Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA
|
|
William J. Dally
|
Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA
|
|
Daniel Maskit
|
Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA
|
|
Nicholas P. Carter
|
Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA
|
|
Andrew Chang
|
Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA
|
|
Whay S. Lee
|
Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA
|
|
| Sponsors |
|
| Publisher |
IEEE Computer Society
Washington, DC, USA
|
| Bibliometrics |
Downloads (6 Weeks): 23, Downloads (12 Months): 49, Citation Count: 13
|
|
|
ABSTRACT
Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Robert Alverson , David Callahan , Daniel Cummings , Brian Koblenz , Allan Porterfield , Burton Smith, The Tera computer system, Proceedings of the 4th international conference on Supercomputing, p.1-6, June 11-15, 1990, Amsterdam, The Netherlands
|
 |
2
|
Ding-Kai Chen , Hong-Men Su , Pen-Chung Yew, The impact of synchronization and granularity on parallel systems, Proceedings of the 17th annual international symposium on Computer Architecture, p.239-248, May 28-31, 1990, Seattle, Washington, United States
|
| |
3
|
|
 |
4
|
A. Krishnamurthy , D. E. Culler , A. Dusseau , S. C. Goldstein , S. Lumetta , T. von Eicken , K. Yelick, Parallel programming in Split-C, Proceedings of the 1993 ACM/IEEE conference on Supercomputing, p.262-273, December 1993, Portland, Oregon, United States
[doi> 10.1145/169627.169724]
|
| |
5
|
Marco Fillo , Stephen W. Keckler , William J. Dally , Nicholas P. Carter , Andrew Chang , Yevgeny Gurevich , Whay S. Lee, The M-Machine multicomputer, Proceedings of the 28th annual international symposium on Microarchitecture, p.146-156, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
| |
6
|
|
| |
7
|
GUREVlCH, Y. The M-Machine operating system. Master of Engineering Thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, September 1995.
|
 |
8
|
D. A. Kranz , R. H. Halstead, Jr. , E. Mohr, Mul-T: a high-performance parallel Lisp, Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, p.81-90, June 19-23, 1989, Portland, Oregon, United States
|
| |
9
|
P. Geoffrey Lowney , Stefan M. Freudenberger , Thomas J. Karzes , W. D. Lichtenstein , Robert P. Nix , John S. O'Donnell , John Ruttenberg, The multiflow trace scheduling compiler, The Journal of Supercomputing, v.7 n.1-2, p.51-142, May 1993
[doi> 10.1007/BF01205182]
|
 |
10
|
Basem A. Nayfeh , Lance Hammond , Kunle Olukotun, Evaluation of design alternatives for a multiprocessor microprocessor, Proceedings of the 23rd annual international symposium on Computer architecture, p.67-77, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
11
|
|
| |
12
|
ROBBINS, K. A., AND ROBBINS, S. The Crav X-MP/Model 24. Springer-Verlag, 1987.
|
| |
13
|
The national technology roadmap for semiconductors. Scmiconductor Industry Association, 1997.
|
 |
14
|
|
| |
15
|
Spec benchmark release v1.1, 1992.
|
 |
16
|
|
 |
17
|
|
 |
18
|
|
CITED BY 14
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lance Hammond , Benedict A. Hubbert , Michael Siu , Manohar K. Prabhu , Michael Chen , Kunle Olukotun, The Stanford Hydra CMP, IEEE Micro, v.20 n.2, p.71-84, March 2000
|
|
|
|
|
|
|
|
|
Richard B. Kujoth , Chi-Wei Wang , Jeffrey J. Cook , Derek B. Gottlieb , Nicholas P. Carter, A wire delay-tolerant reconfigurable unit for a clustered programmable-reconfigurable processor, Microprocessors & Microsystems, v.31 n.2, p.146-159, March, 2007
|
|
|
Yao Guo , Vladimir Vlassov , Raksit Ashok , Richard Weiss , Csaba Andras Moritz, Synchronization coherence: A transparent hardware mechanism for cache coherence and fine-grained synchronization, Journal of Parallel and Distributed Computing, v.68 n.2, p.165-181, February, 2008
|
|
|
Steven Swanson , Andrew Schwerin , Martha Mercaldi , Andrew Petersen , Andrew Putnam , Ken Michelson , Mark Oskin , Susan J. Eggers, The WaveScalar architecture, ACM Transactions on Computer Systems (TOCS), v.25 n.2, p.4-es, May 2007
|
|
|
|
|
|
Jack Sampson , Ruben Gonzalez , Jean-Francois Collard , Norman P. Jouppi , Mike Schlansker , Brad Calder, Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.235-246, December 09-13, 2006
|
|
|
Paul Gratz , Karthikeyan Sankaralingam , Heather Hanson , Premkishore Shivakumar , Robert McDonald , Stephen W. Keckler , Doug Burger, Implementation and Evaluation of a Dynamically Routed Processor Operand Network, Proceedings of the First International Symposium on Networks-on-Chip, p.7-17, May 07-09, 2007
|
|
|
|
|