| Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors |
| Full text |
Pdf
(1.76 MB)
|
| Source
|
International Symposium on Computer Architecture
archive
Proceedings of the 25th annual international symposium on Computer architecture
table of contents
Barcelona, Spain
Pages: 342 - 355
Year of Publication: 1998
ISBN:0-8186-8491-7
Also published in ...
|
|
Authors
|
|
Vijayaraghavan Soundararajan
|
Computer Systems Lab, Stanford University, Stanford, CA
|
|
Mark Heinrich
|
Computer Systems Lab, Stanford University, Stanford, CA
|
|
Ben Verghese
|
Digital Equipment Corporation, Western Research Lab, Palo Alto, CA
|
|
Kourosh Gharachorloo
|
Digital Equipment Corporation, Western Research Lab, Palo Alto, CA
|
|
Anoop Gupta
|
Computer Systems Lab, Stanford University, Stanford, CA and Microsoft Corporation, Redmond, WA
|
|
John Hennessy
|
Computer Systems Lab, Stanford University, Stanford, CA
|
|
| Sponsors |
|
| Publisher |
IEEE Computer Society
Washington, DC, USA
|
| Bibliometrics |
Downloads (6 Weeks): 20, Downloads (12 Months): 61, Citation Count: 15
|
|
|
ABSTRACT
Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance.In this paper we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and data-locality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Agarwal , D. Chaiken , K. Johnson , D. Kranz , J. Kubiatowicz , K. Kurihara , B. H. Lim , G. Maa , D. Nussbaum , M. Parkin , D. Yeung, THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR, Massachusetts Institute of Technology, Cambridge, MA, 1991
|
 |
2
|
Thomas E. Anderson , Brian N. Bershad , Edward D. Lazowska , Henry M. Levy, Scheduler activations: effective kernel support for the user-level management of parallelism, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.95-109, October 13-16, 1991, Pacific Grove, California, United States
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
Data General Corporation. Aviion AV 20000 Server Technical Overview. Data General White Paper, http ://www. dg. corn/about/htm I/aviion_av2OOOO_se rye r tech nical_overview.html, 1997.
|
 |
7
|
|
| |
8
|
Steven Frank, Henry Burkhardt III and Dr. James Rothnie. The KSR 1: Bridging the Gap Between Shared Memory and MPPs. Compcon '93 Proceedings.
|
| |
9
|
|
| |
10
|
Erik Hagersten, Ashley Saulsbury, and Anders Landin. Simple COMA Node Implementations. In Proceedings of the 27th Hawaii International Conference on System Sciences, January. I994.
|
 |
11
|
Mark Heinrich , Jeffrey Kuskin , David Ofelt , John Heinlein , Joel Baxter , Jaswinder Pal Singh , Richard Simoni , Kourosh Gharachorloo , David Nakahira , Mark Horowitz , Anoop Gupta , Mendel Rosenblum , John Hennessy, The performance impact of flexibility in the Stanford FLASH multiprocessor, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.274-285, October 05-07, 1994, San Jose, California, United States
|
| |
12
|
M. Heinnch, V. Soundararajan, A. Gupta. and J. Hennessy. A Quantitative Analysis of the Performance and Scalability of Cache Coherence Protocols. Submitted for publication.
|
| |
13
|
|
 |
14
|
|
 |
15
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
16
|
|
 |
17
|
Daniel Lenoski , James Laudon , Truman Joe , David Nakahira , Luis Stevens , Anoop Gupta , John Hennessy, The DASH prototype: implementation and performance, Proceedings of the 19th annual international symposium on Computer architecture, p.92-103, May 19-21, 1992, Queensland, Australia
|
 |
18
|
|
| |
19
|
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishi. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 1995 International Conference on Parallel Processing, pages I1-I10, August 1995.
|
 |
20
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21ST annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
| |
21
|
|
| |
22
|
Scalable Coherent interface. IEEE Standard 1596-1992. August 1993.
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
 |
27
|
Raj Vaswani , John Zahorjan, The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.26-40, October 13-16, 1991, Pacific Grove, California, United States
|
 |
28
|
Ben Verghese , Scott Devine , Anoop Gupta , Mendel Rosenblum, Operating system support for improving data locality on CC-NUMA compute servers, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.279-289, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
29
|
|
CITED BY 15
|
|
|
|
|
Dimitrios S. Nikolopoulos , Theodore S. Papatheodorou , Constantine D. Polychronopoulos , Jesús Labarta , Eduard Ayguadé, A case for user-level dynamic page migration, Proceedings of the 14th international conference on Supercomputing, p.119-130, May 08-11, 2000, Santa Fe, New Mexico, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|