|
ABSTRACT
In response to current technology scaling trends, architects are developing a new style of processor, known as spatial computers. A spatial computer is composed of hundreds or even thousands of simple, replicated processing elements (or PEs), frequently organized into a grid. Several current spatial computers, such as TRIPS, RAW, SmartMemories, nanoFabrics and WaveScalar, explicitly place a program's instructions onto the grid. Designing instruction placement algorithms is an enormous challenge, as there are an exponential (in the size of the application) number of different mappings of instructions to PEs, and the choice of mapping greatly affects program performance. In this paper we develop an instruction placement performance model which can inform instruction placement. The model comprises three components, each of which captures a different aspect of spatial computing performance: inter-instruction operand latency, data cache coherence overhead, and contention for processing element resources. We evaluate the model on one spatial computer, WaveScalar, and find that predicted and actual performance correlate with a coefficient of -0.90. We demonstrate the model's utility by using it to design a new placement algorithm, which outperforms our previous algorithms. Although developed in the context of WaveScalar, the model can serve as a foundation for tuning code, compiling software, and understanding the microarchitectural trade-offs of spatial computers in general.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
|
| |
3
|
|
| |
4
|
A. P. Bohm and J. Sargeant. Efficient dataflow code generation for sisal. Technical Reports UMCS-85-10-2, Department of Computer Science, University of Manchester, Oct. 1985.
|
| |
5
|
|
| |
6
|
C. Chang, J. Cong, D. Pan, and X. Yuan. Multilevel global placement with congestion control, 2003.
|
 |
7
|
|
 |
8
|
David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh Subramonian , Thorsten von Eicken, LogP: towards a realistic model of parallel computation, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.1-12, May 19-22, 1993, San Diego, California, United States
|
| |
9
|
|
 |
10
|
David E. Culler , Anurag Sah , Klaus E. Schauser , Thorsten von Eicken , John Wawrzynek, Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.164-175, April 08-11, 1991, Santa Clara, California, United States
|
 |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
L. Eeckhout, K. D. Bosschere, and H. Neefs. Performance analysis through synthetic trace generation.
|
 |
15
|
|
| |
16
|
Nikolas Gloy , Trevor Blackwell , Michael D. Smith , Brad Calder, Procedure placement using temporal ordering information, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.303-313, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
17
|
Maya Gokhale , William Holmes , Andrew Kopser , Sara Lucas , Ronald Minnich , Douglas Sweely , Daniel Lopresti, Building and Using a Highly Parallel Programmable Logic Array, Computer, v.24 n.1, p.81-89, January 1991
[doi> 10.1109/2.67197]
|
 |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
Walter Lee , Rajeev Barua , Matthew Frank , Devabhaktuni Srikrishna , Jonathan Babb , Vivek Sarkar , Saman Amarasinghe, Space-time scheduling of instruction-level parallelism on a raw machine, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.46-57, October 02-07, 1998, San Jose, California, United States
|
| |
24
|
C. Lin. ZPL Language Reference Manual. UW-CSE-TR 94-10-06, University of Washington, 1996.
|
| |
25
|
J. Lo, S. Eggers, H. Levy, and D. Tullsen. Compilation issues for a simultaneous multithreading processor, 1996.
|
 |
26
|
Ken Mai , Tim Paaske , Nuwan Jayasena , Ron Ho , William J. Dally , Mark Horowitz, Smart Memories: a modular reconfigurable architecture, Proceedings of the 27th annual international symposium on Computer architecture, p.161-171, June 2000, Vancouver, British Columbia, Canada
|
| |
27
|
R. Nikhil. ID Version 88.1, Reference Manual. MIT, Laboratory for Computer Science, Cambridge, MA, 90.1 edition, 1991.
|
| |
28
|
D. B. Noonburg and J. P. Shen. A framework for statistical modeling of superscalar processor performance.
|
| |
29
|
S. Nussbaum and J. Smith. Modeling superscalar processors via statistical simulation, 2001.
|
 |
30
|
|
 |
31
|
|
 |
32
|
|
 |
33
|
S. Sakai , y. Yamaguchi , K. Hiraki , Y. Kodama , T. Yuba, An architecture of a dataflow single chip processor, Proceedings of the 16th annual international symposium on Computer architecture, p.46-53, April 1989, Jerusalem, Israel
|
 |
34
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Doug Burger , Stephen W. Keckler , Charles R. Moore, Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
35
|
F. I. Scalability. Design and analysis of routed inter-alu networks.
|
 |
36
|
T. Shimada , K. Hiraki , K. Nishida , S. Sekiguchi, Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations, Proceedings of the 13th annual international symposium on Computer architecture, p.226-234, June 02-05, 1986, Tokyo, Japan
|
| |
37
|
|
| |
38
|
SPEC. Spec CPU 2000 benchmark specifications. SPEC2000 Benchmark Release, 2000.
|
| |
39
|
|
| |
40
|
S. Swanson, A. Putnam, M. Mercaldi, K. Michelson, A. Petersen, A. Schwerin, M. Oskin, and S. Eggers. The wavescalar architecture.
|
 |
41
|
Steven Swanson , Andrew Putnam , Martha Mercaldi , Ken Michelson , Andrew Petersen , Andrew Schwerin , Mark Oskin , Susan J. Eggers, Area-Performance Trade-offs in Tiled Dataflow Architectures, Proceedings of the 33rd annual international symposium on Computer Architecture, p.314-326, June 17-21, 2006
|
| |
42
|
|
| |
43
|
W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Computational Complexity.
|
 |
44
|
|
| |
45
|
Elliot Waingold , Michael Taylor , Devabhaktuni Srikrishna , Vivek Sarkar , Walter Lee , Victor Lee , Jang Kim , Matthew Frank , Peter Finch , Rajeev Barua , Jonathan Babb , Saman Amarasinghe , Anant Agarwal, Baring It All to Software: Raw Machines, Computer, v.30 n.9, p.86-93, September 1997
[doi> 10.1109/2.612254]
|
CITED BY 3
|
|
|
|
|
Steven Swanson , Andrew Schwerin , Martha Mercaldi , Andrew Petersen , Andrew Putnam , Ken Michelson , Mark Oskin , Susan J. Eggers, The WaveScalar architecture, ACM Transactions on Computer Systems (TOCS), v.25 n.2, p.4-es, May 2007
|
|
|
Katherine E. Coons , Behnam Robatmili , Matthew E. Taylor , Bertrand A. Maher , Doug Burger , Kathryn S. McKinley, Feature selection and policy optimization for distributed instruction placement using reinforcement learning, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|