|
ABSTRACT
The performance of modern microprocessors is increasingly limited by their inability to hide main memory latency. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose the use of Active Memory Operations (AMOs), in which select operations can be sent to and executed on the home memory controller of data. AMOs can eliminate significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper we present architectural and programming models for AMOs, and compare its performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster database queries. Based on a standard cell implementation, we predict that the circuitry required to support AMOs is less than 1% of the typical chip area of a high performance microprocessor.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
TPC-D, Past, Present and Future: An Interview between Berni Schiefer, Chair of the TPC-D Subcommittee and Kim Shanley, TPC Chief Operating Officer. available from http://www.tpc.org/.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
David Patterson , Thomas Anderson , Neal Cardwell , Richard Fromm , Kimberly Keeton , Christoforos Kozyrakis , Randi Thomas , Katherine Yelick, A Case for Intelligent RAM, IEEE Micro, v.17 n.2, p.34-44, March 1997
[doi> 10.1109/40.592312]
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
 |
11
|
Mary Hall , Peter Kogge , Jeff Koller , Pedro Diniz , Jacqueline Chame , Jeff Draper , Jeff LaCoss , John Granacki , Jay Brockman , Apoorv Srivastava , William Athas , Vincent Freeh , Jaewook Shin , Joonseok Park, Mapping irregular applications to DIVA, a PIM-based data-intensive architecture, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.57-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331589]
|
| |
12
|
Hewlett-Packard Inc. The open source database benchmark.
|
| |
13
|
Intel Corp. Intel Itanium 2 processor reference manual.
|
| |
14
|
International Technology Roadmap for Semiconductors.
|
| |
15
|
K. Keeton and D. Patterson. Towards a Simplified Database Workloads for Computer Architecture Evaluation. 2000.
|
| |
16
|
|
| |
17
|
D. Koester and J. Kepner. HPCS Assessment Framework and Benchmarks. MITRE and MIT Lincoln Laboratory, Mar. 2003.
|
| |
18
|
P. Kogge. The EXECUBE approach to massively parallel processing. In International Conference on Parallel Processing, Aug. 1994.
|
 |
19
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21st annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
20
|
|
| |
21
|
J. McCalpin. The stream benchmark, 1999.
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
F. Petrini, et al. Scalable collective communication on the ASCI Q machine. In Hot Interconnects 11, Aug. 2003.
|
| |
26
|
|
| |
27
|
R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proc. of the Sixth HPCA, pp. 168--179, Jan. 2000.
|
 |
28
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21st annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
 |
29
|
|
| |
30
|
SGI. SN2-MIPS Communication Protocol Specification, 2001.
|
| |
31
|
SGI. Orbit Functional Specification, Vol. 1, 2002.
|
| |
32
|
M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. TR CMU-CS-03-161, Carnegie Mellon University, 2003.
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
J. Torrellas, A.-T. Nguyen, and L. Yang. Toward a cost-effective DSM organization that exploits processor-memory integration. In Proc. of the 7th HPCA, pp. 15--25, Jan. 2000.
|
 |
37
|
Thorsten von Eicken , David E. Culler , Seth Copen Goldstein , Klaus Erik Schauser, Active messages: a mechanism for integrated communication and computation, Proceedings of the 19th annual international symposium on Computer architecture, p.256-266, May 19-21, 1992, Queensland, Australia
|
| |
38
|
L. Zhang. UVSIM reference manual. TR UUCS-03-011, University of Utah, May 2003.
|
| |
39
|
L. Zhang, Z. Fang, and J. B. Carter. Highly efficient synchronization based on active memory operations. In IPDPS, Apr. 2004.
|
| |
40
|
Lixin Zhang , Zhen Fang , Mide Parker , Binu K. Mathew , Lambert Schaelicke , John B. Carter , Wilson C. Hsieh , Sally A. McKee, The Impulse Memory Controller, IEEE Transactions on Computers, v.50 n.11, p.1117-1132, November 2001
[doi> 10.1109/12.966490]
|
CITED BY 2
|
|
|
|
|
Sanjeev Kumar , Daehyun Kim , Mikhail Smelyanskiy , Yen-Kuang Chen , Jatin Chhugani , Christopher J. Hughes , Changkyu Kim , Victor W. Lee , Anthony D. Nguyen, Atomic Vector Operations on Chip Multiprocessors, ACM SIGARCH Computer Architecture News, v.36 n.3, p.441-452, June 2008
|
|