|
ABSTRACT
Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function. To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication. This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
3
|
Arvind. 2005. Dataflow: Passing the token. ISCA keynote in Annual International Symposium on Computer Architecture.
|
 |
4
|
|
 |
5
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
6
|
Barth, P. S., Nikhil, R. S., and Arvind. 1991. M-structures: Extending a parallel, non-strict, functional languages with state. Tech. Rep. MIT/LCS/TR-327, Massachusetts Institute of Technology. March.
|
| |
7
|
|
 |
8
|
Mihai Budiu , Girish Venkataramani , Tiberiu Chelcea , Seth Copen Goldstein, Spatial computation, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
| |
9
|
Cadence. 2007. Cadence website. http://www.cadence.com.
|
 |
10
|
|
| |
11
|
Culler, D. E. 1990. Managing parallelism and resources in scientific dataflow programs. Ph.D. thesis, Massachusetts Institute of Technology.
|
 |
12
|
David E. Culler , Anurag Sah , Klaus E. Schauser , Thorsten von Eicken , John Wawrzynek, Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.164-175, April 08-11, 1991, Santa Clara, California, United States
|
 |
13
|
|
| |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
Desikan, R., Burger, D. C., Keckler, S. W., and Austin, T. M. 2001. Sim-Alpha: A validated, execution-driven Alpha 21264 simulator. Tech. Rep. TR-01-23, University of Texas-Austin, Department of Computer Sciences.
|
| |
18
|
Ekman, M. and Stenström, P. 2003. Performance and power impact of issue width in chip-multiprocessor cores. In Proceedings of the International Conference on Parallel Processing.
|
| |
19
|
Feo, J. T., Miller, P. J., and Skedzielewski, S. K. 1995. SISAL90. In Proceedings of the Conference on High Performance Functional Computing.
|
 |
20
|
|
 |
21
|
James R. Goodman , Mary K. Vernon , Philip J. Woest, Efficient synchronization primitives for large-scale cache-coherent multiprocessors, Proceedings of the third international conference on Architectural support for programming languages and operating systems, p.64-75, April 03-06, 1989, Boston, Massachusetts, United States
|
 |
22
|
|
 |
23
|
|
| |
24
|
Lance Hammond , Benedict A. Hubbert , Michael Siu , Manohar K. Prabhu , Michael Chen , Kunle Olukotun, The Stanford Hydra CMP, IEEE Micro, v.20 n.2, p.71-84, March 2000
[doi> 10.1109/40.848474]
|
 |
25
|
M. S. Hrishikesh , Doug Burger , Norman P. Jouppi , Stephen W. Keckler , Keith I. Farkas , Premkishore Shivakumar, The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays, Proceedings of the 29th annual international symposium on Computer architecture, May 25-29, 2002, Anchorage, Alaska
|
| |
26
|
Jain, A. E. A. 2001. A 1.2GHz Alpha microprocessor with 44.8GB/s chip pin bandwidth. In Proceedings of the IEEE International Solid-State Circuits Conference. Vol. 1. 240--241.
|
 |
27
|
Stephen W. Keckler , William J. Dally , Daniel Maskit , Nicholas P. Carter , Andrew Chang , Whay S. Lee, Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor, Proceedings of the 25th annual international symposium on Computer architecture, p.306-317, June 27-July 02, 1998, Barcelona, Spain
|
 |
28
|
|
| |
29
|
Krewel, K. 2005. Alpha EV7 processor: A high-performance tradition continues. Microprocessor Rep.
|
| |
30
|
Chunho Lee , Miodrag Potkonjak , William H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.330-335, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
31
|
Walter Lee , Rajeev Barua , Matthew Frank , Devabhaktuni Srikrishna , Jonathan Babb , Vivek Sarkar , Saman Amarasinghe, Space-time scheduling of instruction-level parallelism on a raw machine, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.46-57, October 02-07, 1998, San Jose, California, United States
|
 |
32
|
Jack L. Lo , Joel S. Emer , Henry M. Levy , Rebecca L. Stamm , Dean M. Tullsen , S. J. Eggers, Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading, ACM Transactions on Computer Systems (TOCS), v.15 n.3, p.322-354, Aug. 1997
[doi> 10.1145/263326.263382]
|
 |
33
|
Scott A. Mahlke , David C. Lin , William Y. Chen , Richard E. Hank , Roger A. Bringmann, Effective compiler support for predicated execution using the hyperblock, Proceedings of the 25th annual international symposium on Microarchitecture, p.45-54, December 01-04, 1992, Portland, Oregon, United States
|
 |
34
|
Ken Mai , Tim Paaske , Nuwan Jayasena , Ron Ho , William J. Dally , Mark Horowitz, Smart Memories: a modular reconfigurable architecture, Proceedings of the 27th annual international symposium on Computer architecture, p.161-171, June 2000, Vancouver, British Columbia, Canada
|
 |
35
|
Martha Mercaldi , Steven Swanson , Andrew Petersen , Andrew Putnam , Andrew Schwerin , Mark Oskin , Susan J. Eggers, Instruction scheduling for a tiled dataflow architecture, Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, October 21-25, 2006, San Jose, California, USA
|
 |
36
|
Martha Mercaldi , Steven Swanson , Andrew Petersen , Andrew Putnam , Andrew Schwerin , Mark Oskin , Susan J. Eggers, Modeling instruction placement on a spatial architecture, Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures, July 30-August 02, 2006, Cambridge, Massachusetts, USA
[doi> 10.1145/1148109.1148137]
|
| |
37
|
|
| |
38
|
|
| |
39
|
|
| |
40
|
Nikhil, R. 1990. The parallel programming language id and its compilation for parallel machines. In Proceedings of the Workshop on Massive Paralleism: Hardware, Programming and Applications. Acamedic Press.
|
 |
41
|
|
 |
42
|
|
 |
43
|
Andrew Petersen , Andrew Putnam , Martha Mercaldi , Andrew Schwerin , Susan Eggers , Steve Swanson , Mark Oskin, Reducing control overhead in dataflow architectures, Proceedings of the 15th international conference on Parallel architectures and compilation techniques, September 16-20, 2006, Seattle, Washington, USA
[doi> 10.1145/1152154.1152184]
|
 |
44
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Doug Burger , Stephen W. Keckler , Charles R. Moore, Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
45
|
Shimada, T., Hiraki, K., and Nishida, K. 1984. An architecture of a data flow machine and its evaluation. ACM SIGARCH Comput. Architecure News 14, 2, 226--234.
|
 |
46
|
T. Shimada , K. Hiraki , K. Nishida , S. Sekiguchi, Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations, Proceedings of the 13th annual international symposium on Computer architecture, p.226-234, June 02-05, 1986, Tokyo, Japan
|
| |
47
|
Aaron Smith , Jon Gibson , Bertrand Maher , Nick Nethercote , Bill Yoder , Doug Burger , Kathryn S. McKinle , Jim Burrill, Compiling for EDGE Architectures, Proceedings of the International Symposium on Code Generation and Optimization, p.185-195, March 26-29, 2006
[doi> 10.1109/CGO.2006.10]
|
| |
48
|
SPEC. 2000. SPEC CPU 2000 benchmark specifications. SPEC2000 Benchmark Release.
|
| |
49
|
|
| |
50
|
|
 |
51
|
Steven Swanson , Andrew Putnam , Martha Mercaldi , Ken Michelson , Andrew Petersen , Andrew Schwerin , Mark Oskin , Susan J. Eggers, Area-Performance Trade-offs in Tiled Dataflow Architectures, Proceedings of the 33rd annual international symposium on Computer Architecture, p.314-326, June 17-21, 2006
|
 |
52
|
Michael Bedford Taylor , Walter Lee , Jason Miller , David Wentzlaff , Ian Bratt , Ben Greenwald , Henry Hoffmann , Paul Johnson , Jason Kim , James Psota , Arvind Saraf , Nathan Shnidman , Volker Strumpen , Matt Frank , Saman Amarasinghe , Anant Agarwal, Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams, Proceedings of the 31st annual international symposium on Computer architecture, p.2, June 19-23, 2004, München, Germany
|
| |
53
|
TSMC. 2007. Silicon design chain cooperation enables nanometer chip design. Cadence Whitepaper. http://www.cadence.com/whitepapers/.
|
| |
54
|
|
CITED BY 2
|
|
Steven Swanson , Andrew Putnam , Martha Mercaldi , Ken Michelson , Andrew Petersen , Andrew Schwerin , Mark Oskin , Susan J. Eggers, Area-Performance Trade-offs in Tiled Dataflow Architectures, ACM SIGARCH Computer Architecture News, v.34 n.2, p.314-326, May 2006
|
|
|
|
REVIEW
"Joseph M. Arul : Reviewer"
The WaveScalar architecture aims to take advantages found in dataflow to achieve more parallelism with less silicon area, along with better performance and scalability than superscalar architecture. The program counter, which is a bottleneck in vo
more...
|