|
ABSTRACT
This paper describes the architecture and implementation of the AlphaServer GS320, a cache-coherent non-uniform memory access multiprocessor developed at Compaq. The AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing with 32 to 64 processors. Each node in the design consists of four Alpha 21264 processors, up to 32GB of coherent memory, and an aggressive IO subsystem. The current implementation supports up to 8 such nodes for a total of 32 processors. While snoopy-based designs have been stretched to medium-scale multiprocessors by some vendors, providing sufficient snoop bandwidth remains a major challenge especially in systems with aggressive processors. At the same time, directory protocols targeted at larger scale designs lead to a number of inherent inefficiencies relative to snoopy designs. A key goal of the AlphaServer GS320 architecture has been to achieve the best-of-both-worlds, partly by exploiting the bounded scale of the target systems.This paper focuses on the unique design features used in the AlphaServer GS320 to efficiently implement coherence and consistency. The guiding principle for our directory-based protocol is to address correctness issues related to rare protocol races without burdening the common transaction flows. Our protocol exhibits lower occupancy and lower message counts compared to previous designs, and provides more efficient handling of 3-hop transactions. Furthermore, our design naturally lends itself to elegant solutions for deadlock, livelock, starvation, and fairness. The AlphaServer GS320 architecture also incorporates a couple of innovative techniques that extend previous approaches for efficiently implementing memory consistency models. These techniques allow us to generate commit events (which are used for ordering purposes) well in advance of formulating the reply to a transaction. Furthermore, the separation of the commit event allows time-critical replies to bypass inbound requests without violating ordering properties. Even though our design specifically targets medium-scale servers, many of the same techniques can be applied to larger-scale directory-based and smaller-scale snoopy-based designs. Finally, we evaluate the performance impact of some of the above optimizations and present a few competitive benchmark results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Y. Afek , G. Brown , M. Merritt, A lazy cache algorithm, Proceedings of the first annual ACM symposium on Parallel algorithms and architectures, p.209-222, June 18-21, 1989, Santa Fe, New Mexico, United States
[doi> 10.1145/72935.72958]
|
 |
2
|
|
| |
3
|
P. Bannon. Alpha 21364: A Scalable Single-Chip SMP. In Microprocessor Forum '98, October 1998. (also available at http://www.digital.eom/alpha-oem/microprocessorforum.htm).
|
 |
4
|
|
 |
5
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
6
|
L. A. Barroso, K. Gharachodoo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In Proceedings of the 6th International Symposium on High Performance Computer Architecture, January 2000.
|
 |
7
|
E. Ender Bilir , Ross M. Dickson , Ying Hu , Manoj Plakal , Daniel J. Sorin , Mark D. Hill , David A. Wood, Multicast snooping: a new coherence method using a multicast address network, Proceedings of the 26th annual international symposium on Computer architecture, p.294-304, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
8
|
|
| |
9
|
L. M. Censier and P. Feantrier. A new solution to coherence problems in multicache systems. IEEE Transactions on Computers, C- 27( 12): 1112-1118, December 1978.
|
| |
10
|
|
 |
11
|
Michel Dubois , Jin Chin Wang , Luiz A. Barroso , Kangwoo Lee , Yung-Syau Chen, Delayed consistency and its effects on the miss rate of parallel programs, Proceedings of the 1991 ACM/IEEE conference on Supercomputing, p.197-206, November 18-22, 1991, Albuquerque, New Mexico, United States
[doi> 10.1145/125826.125941]
|
| |
12
|
J. Emer. Relaxing Constraints: Thoughts on the Evolution of Computer Architecture. Keynote Speech at the 6th International Symposium on High Performance Computer Architecture, Toulouse, France. January 10, 2000.
|
| |
13
|
|
| |
14
|
K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing, pages 1:355-364, August 1991.
|
 |
15
|
Chris Gniady , Babak Falsafi , T. N. Vijaykumar, Is SC + ILP = RC?, Proceedings of the 26th annual international symposium on Computer architecture, p.162-171, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
16
|
D.B.GustavsonandQ.Li. Thescalablecoherentinterface(sci).IEEE Communications Magazine, pages 52-63, August 1996.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
 |
24
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
| |
25
|
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690-691, September 1979.
|
 |
26
|
Anders Landin , Erik Hagersten , Seif Haridi, Race-free interconnection networks and multiprocessor consistency, Proceedings of the 18th annual international symposium on Computer architecture, p.106-115, May 27-30, 1991, Toronto, Ontario, Canada
|
 |
27
|
|
 |
28
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Anoop Gupta , John Hennessy, The directory-based cache coherence protocol for the DASH multiprocessor, Proceedings of the 17th annual international symposium on Computer Architecture, p.148-159, May 28-31, 1990, Seattle, Washington, United States
|
 |
29
|
|
 |
30
|
|
 |
31
|
Daniel J. Scales , Kourosh Gharachorloo , Chandramohan A. Thekkath, Shasta: a low overhead, software-only approach for supporting fine-grain shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.174-185, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
32
|
|
| |
33
|
|
| |
34
|
Sun Microsystems. Sun Enterprise I0000 Server - Technical White Paper. http:llwww.sun.coredserverslwhite-paperslEl OOOO.pdf.
|
 |
35
|
Dan Teodosiu , Joel Baxter , Kinshuk Govil , John Chapin , Mendel Rosenblum , Mark Horowitz, Hardware fault containment in scalable shared-memory multiprocessors, Proceedings of the 24th annual international symposium on Computer architecture, p.73-84, June 01-04, 1997, Denver, Colorado, United States
|
| |
36
|
|
| |
37
|
|
 |
38
|
Ben Verghese , Scott Devine , Anoop Gupta , Mendel Rosenblum, Operating system support for improving data locality on CC-NUMA compute servers, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.279-289, October 01-04, 1996, Cambridge, Massachusetts, United States
|
|