|
ABSTRACT
There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications on systems with up to 16 cores, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and nonallocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
 |
3
|
Jung Ho Ahn , William J. Dally , Brucek Khailany , Ujval J. Kapasi , Abhishek Das, Evaluating the Imagine Stream Architecture, Proceedings of the 31st annual international symposium on Computer architecture, p.14, June 19-23, 2004, München, Germany
|
| |
4
|
Andrews, J. and Backer, N. 2005. Xbox360 system architecture. In Conference Record of Hot Chips 17. Stanford, CA.
|
 |
5
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
6
|
Chen, Y.-K., Li, E. Q., Zhou, X., and Ge, S. 2006. Implementation of h.264 encoder and decoder on personal computers. J. Visual Communication and Image Representation 17, 2, 509--532.
|
| |
7
|
|
| |
8
|
|
| |
9
|
William J. Dally , Francois Labonte , Abhishek Das , Patrick Hanrahan , Jung-Ho Ahn , Jayanth Gummaraju , Mattan Erez , Nuwan Jayasena , Ian Buck , Timothy J. Knight , Ujval J. Kapasi, Merrimac: Supercomputing with Streams, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.35, November 15-21, 2003
|
| |
10
|
|
| |
11
|
Drake, M., Hoffmann, H., Rabbah, R., and Amarasinghe, S. 2006. Mpeg-2 decoding in a stream programming language. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium, Rhodes Island (IPDPS).
|
| |
12
|
Eatherton, W. 2005. The push of network processing to the top of the pyramid. Keynote presentation at the Symposium on Architectures for Networking and Communication Systems, Princeton, NJ.
|
 |
13
|
Mattan Erez , Jung Ho Ahn , Jayanth Gummaraju , Mendel Rosenblum , William J. Dally, Executing irregular scientific applications on stream architectures, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
[doi> 10.1145/1274971.1274987]
|
 |
14
|
Kayvon Fatahalian , Daniel Reiter Horn , Timothy J. Knight , Larkhoon Leem , Mike Houston , Ji Young Park , Mattan Erez , Manman Ren , Alex Aiken , William J. Dally , Pat Hanrahan, Sequoia: programming the memory hierarchy, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
[doi> 10.1145/1188455.1188543]
|
 |
15
|
|
 |
16
|
Michael I. Gordon , William Thies , Michal Karczmarek , Jasper Lin , Ali S. Meli , Andrew A. Lamb , Chris Leger , Jeremy Wong , Henry Hoffmann , David Maze , Saman Amarasinghe, A stream compiler for communication-exposed architectures, Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, October 05-09, 2002, San Jose, California
|
| |
17
|
Gschwind, M. et al. 2005. A novel SIMD architecture for the cell heterogeneous chip-multiprocessor. In Conference Record of Hot Chips 17.
|
 |
18
|
Jayanth Gummaraju , Joel Coburn , Yoshio Turner , Mendel Rosenblum, Streamware: programming general-purpose multicore processors using streams, Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, March 01-05, 2008, Seattle, WA, USA
|
| |
19
|
|
| |
20
|
|
| |
21
|
Havran, V. 2002. Heuristic ray shooting algorithms. Ph.D. thesis, Czech Technical University in Prague.
|
| |
22
|
Heinlein, J., Gharachorloo, K., Dresser, S., and Gupta, A. 1994. Integration of message passing and shared memory in the stanford flash multiprocessor. SIGOPS Oper. Syst. Rev. 28, 5, 38--50.
|
| |
23
|
Ho, R., Mai, K., and Horowitz, M. 2001. The Future of wires. Proceedings of the IEEE 89, 4 (Apr.).
|
| |
24
|
Ho, R., Mai, K., and Horowitz, M. 2003. Efficient on-chip global interconnects. In Symposium on VLSI Circuits. 271--274.
|
| |
25
|
Horowitz, M. and Dally, W. 2004. How scaling will change processor architecture. In Proceedings of the International Solid-State Circuits Conference. 132--133.
|
| |
26
|
Independent JPEG Group. 1998. IJG's JPEG Software Release 6b.
|
| |
27
|
ITU-T Rec. H.264. 2002. ISO/IEC 144496-10 AVC. 2002.
|
| |
28
|
Jani, D., Ezer, G., and Kim, J. 2004. Long words and wide ports: Reinventing the Configurable Processor. In Proceedings of the Conference Record of Hot Chips 16. Stanford, CA.
|
| |
29
|
|
| |
30
|
Khailany, B., Williams, T., Lin, J., Long, E., Rygh, M., Tovey, D., and Dally, W. 2008. A programmable 512 gops stream processor for signal, image, and video processing. IEEE Journal of Solid-State Circuits 43, 1, 202--213.
|
 |
31
|
|
| |
32
|
Kongetira, P. 2004. A 32-way Multithreaded sparc processor. In Proceedings of the Conference Record of Hot Chips.
|
 |
33
|
David Kranz , Kirk Johnson , Anant Agarwal , John Kubiatowicz , Beng-Hong Lim, Integrating message-passing and shared-memory: early experience, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.54-63, May 19-22, 1993, San Diego, California, United States
|
 |
34
|
|
 |
35
|
Jacob Leverich , Hideho Arakida , Alex Solomatnikov , Amin Firoozshahian , Mark Horowitz , Christos Kozyrakis, Comparing memory systems for chip multiprocessors, Proceedings of the 34th annual international symposium on Computer architecture, June 09-13, 2007, San Diego, California, USA
|
| |
36
|
|
| |
37
|
Li, M. et al. 2005. ALP: efficient support for all levels of parallelism for complex Media applications. Tech. Rep. UIUCDCS-R-2005-2605, UIUC CS. July.
|
 |
38
|
|
| |
39
|
Lin, Y. 2004. A programmable Vector coprocessor architecture for wireless applications. In Proceedings of the 3rd Workshop on Application Specific Processors.
|
| |
40
|
|
| |
41
|
Machnicki, E. 2005. Ultra high performance scalable DSP family for multimedia. In Proceedings of the Conference Record of Hot Chips 17.
|
 |
42
|
Ken Mai , Tim Paaske , Nuwan Jayasena , Ron Ho , William J. Dally , Mark Horowitz, Smart Memories: a modular reconfigurable architecture, Proceedings of the 27th annual international symposium on Computer architecture, p.161-171, June 2000, Vancouver, British Columbia, Canada
|
| |
43
|
MIPS32 2001. MIPS32 Architecture For Programmers Volume II: The MIPS32 Instruction Set. MIPS Technologies, Inc.
|
 |
44
|
|
| |
45
|
MPEG Software Simulation Group. Mssg mpeg2 encoder and decoder. Available at: http://www.mpeg.org/MPEG/MSSG/.
|
 |
46
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Nitya Ranganathan , Doug Burger , Stephen W. Keckler , Robert G. McDonald , Charles R. Moore, TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP, ACM Transactions on Architecture and Code Optimization (TACO), v.1 n.1, p.62-93, March 2004
[doi> 10.1145/980152.980156]
|
 |
47
|
Jinwoo Suh , Eun-Gyu Kim , Stephen P. Crago , Lakshmi Srinivasan , Matthew C. French, A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
48
|
Tarjan, D., Thoziyoor, S., and Jouppi, N. P. 2006. CACTI 4.0. Tech. Rep. HPL-2006-86, HP Labs.
|
 |
49
|
Michael Bedford Taylor , Walter Lee , Jason Miller , David Wentzlaff , Ian Bratt , Ben Greenwald , Henry Hoffmann , Paul Johnson , Jason Kim , James Psota , Arvind Saraf , Nathan Shnidman , Volker Strumpen , Matt Frank , Saman Amarasinghe , Anant Agarwal, Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams, Proceedings of the 31st annual international symposium on Computer architecture, p.2, June 19-23, 2004, München, Germany
|
| |
50
|
Tensilica 2007. Tensilica Software Tools. http://www.tensilica.com/products/software.htm.
|
 |
51
|
|
 |
52
|
|
| |
53
|
|
 |
54
|
Zhenlin Wang , Doug Burger , Kathryn S. McKinley , Steven K. Reinhardt , Charles C. Weems, Guided region prefetching: a cooperative hardware/software approach, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
55
|
Yeh, T.-Y. 2005. The low-power high-performance architecture of the PWRficient processor family. In Proceedings of the Conference Record of Hot Chips 17.
|
|