|
ABSTRACT
Active memory systems help processors overcome the memory wall when applications exhibit poor cache behavior. They consist of either active memory elements that perform data parallel computations in the memory system itself, or an active memory controller that supports address re-mapping techniques that improve data locality. Both active memory approaches create coherence problems---even on uniprocessor systems---since there are either additional processors operating on the data directly, or the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we propose a new technique to solve the coherence problem by extending the coherence protocol. Our active memory controller leverages and extends the coherence mechanism, so that re-mapping techniques work transparently on both uniprocessor and multiprocessor systems.We present a microarchitecture for an active memory controller with a programmable core and specialized hardware that accelerates cache line assembly and disassembly. We present detailed simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. In addition to uniprocessor speedup, we show single-node multiprocessor speedup for parallel active memory applications and discuss how the same controller architecture supports coherent multi-node systems called active memory clusters.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
J. Carter , W. Hsieh , L. Stoller , M. Swanson , L. Zhang , E. Brunvand , A. Davis , C.-C. Kuo , R. Kuramkote , M. Parker , L. Schaelicke , T. Tateyama, Impulse: Building a Smarter Memory Controller, Proceedings of the 5th International Symposium on High Performance Computer Architecture, p.70, January 09-12, 1999
|
| |
3
|
M. Frigo and S. G. Johnson. FFTW: An Adaptive Software Architecture for the FFT. In Proceedings of the 23rd International Conference on Acoustics, Speech, and Signal Processing, pages 1381--1384, 1998
|
| |
4
|
María Jesús Garzarán , Milos Prvulovic , Ye Zhangy , Josep Torrellas , Alin Jula , Hao Yu , Lawrence Rauchwerger, Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors, Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, p.243, September 08-12, 2001
|
 |
5
|
Kourosh Gharachorloo , Madhu Sharma , Simon Steely , Stephen Van Doren, Architecture and design of AlphaServer GS320, Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, p.13-24, November 2000, Cambridge, Massachusetts, United States
|
 |
6
|
Jeff Gibson , Robert Kunz , David Ofelt , Mark Horowitz , John Hennessy , Mark Heinrich, FLASH vs. (Simulated) FLASH: closing the simulation loop, Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, p.49-58, November 2000, Cambridge, Massachusetts, United States
|
| |
7
|
|
 |
8
|
Mary Hall , Peter Kogge , Jeff Koller , Pedro Diniz , Jacqueline Chame , Jeff Draper , Jeff LaCoss , John Granacki , Jay Brockman , Apoorv Srivastava , William Athas , Vincent Freeh , Jaewook Shin , Joonseok Park, Mapping irregular applications to DIVA, a PIM-based data-intensive architecture, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.57-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331589]
|
 |
9
|
Mark Heinrich , Jeffrey Kuskin , David Ofelt , John Heinlein , Joel Baxter , Jaswinder Pal Singh , Richard Simoni , Kourosh Gharachorloo , David Nakahira , Mark Horowitz , Anoop Gupta , Mendel Rosenblum , John Hennessy, The performance impact of flexibility in the Stanford FLASH multiprocessor, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.274-285, October 05-07, 1994, San Jose, California, United States
|
| |
10
|
|
| |
11
|
InfiniBand Trade Association. InfiniBand Architecture Specification, Volume 1.0, Release 1.0, October 2000
|
| |
12
|
Intel, http://developer.intel.com/technology/3gio/. Creating a Third Generation I/O Interconnect
|
| |
13
|
|
| |
14
|
D. Keen et al. Cache Coherence in Intelligent Memory Systems. In ISCA 2000 Solving the Memory Wall Problem Workshop, June 2000
|
| |
15
|
D. Kim, M. Chaudhuri, and M. Heinrich. Leveraging Cache Coherence in Active Memory Systems. Technical Report CSL-TR-2001-1018, Computer Systems Laboratory, Cornell University, November 2001
|
 |
16
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
R. Manohar and M. Heinrich. A Case for Asynchronous Active Memories. In ISCA 2000 Solving the Memory Wall Problem Workshop, June 2000
|
 |
21
|
Binu K. Mathew , Sally A. McKee , John B. Carter , Al Davis, Algorithmic foundations for a parallel vector access memory system, Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures, p.156-165, July 09-13, 2000, Bar Harbor, Maine, United States
[doi> 10.1145/341800.341819]
|
 |
22
|
|
| |
23
|
A. K. Nanda et al. High-Throughput Coherence Controllers. In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, January 2000
|
| |
24
|
A. Nowatzyk et al. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 24th International Conference on Parallel Processing, 1995
|
 |
25
|
|
 |
26
|
Ashley Saulsbury , Fong Pong , Andreas Nowatzyk, Missing the memory wall: the case for processor/memory integration, Proceedings of the 23rd annual international symposium on Computer architecture, p.90-101, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
27
|
Y. Sazeides and J. E. Smith. Implementations of Context Based Value Predictors. Technical Report ECE-97-8, University of Wisconsin-Madison, December 1997
|
| |
28
|
|
| |
29
|
|
| |
30
|
Silicon Graphics, http://www.sgi.com/origin/3000/. SGI 3000 Family Reference Guide
|
| |
31
|
|
| |
32
|
Sun Microsystems, http://www.sun.com/servers/white-papers/. Sun Enterprise 10000 Server--Technical White Paper
|
| |
33
|
Titan Systems, http://www.aaec.com/projectweb/dis/. DIS Benchmark Suite
|
| |
34
|
J. Torrellas, L. Yang, and A. T. Nguyen. Toward a Cost-Effective DSM Organization that Exploits Processor-Memory Integration. In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, pages 15--25, January 2000
|
 |
35
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
36
|
|
| |
37
|
L. Zhang et al. Pointer-Based Prefetching within the Impulse Adaptable Memory Controller: Initial Results. In Proceedings of the ISCA-2000 Workshop on Solving the Memory Wall Problem, June 2000
|
|