|
ABSTRACT
Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations---one based on a cache-outcome condition code and another based on low-overhead traps---and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
ABC+95
|
Anant Agarwal , Ricardo Bianchini , David Chaiken , Kirk L. Johnson , David Kranz , John Kubiatowicz , Beng-Hong Lim , Kenneth Mackenzie , Donald Yeung, The MIT Alewife machine: architecture and performance, Proceedings of the 22nd annual international symposium on Computer architecture, p.2-13, June 22-24, 1995, S. Margherita Ligure, Italy
|
 |
ACC+90
|
Robert Alverson , David Callahan , Daniel Cummings , Brian Koblenz , Allan Porterfield , Burton Smith, The Tera computer system, Proceedings of the 4th international conference on Supercomputing, p.1-6, June 11-15, 1990, Amsterdam, The Netherlands
|
| |
AKK+93
|
Anant Agarwal , John Kubiatowicz , David Kranz , Beng-Hong Lim , Donald Yeung , Godfrey D'Souza , Mike Parkin, Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors, IEEE Micro, v.13 n.3, p.48-61, May 1993
[doi> 10.1109/40.216748]
|
| |
AKL79
|
W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. Automatic Program Transformations for Virtual Memory Computers. Proc. 1979 National Computer Conf. pp 969-974, June 1979.
|
 |
BLA+94
|
M. A. Blumrich , K. Li , R. Alpert , C. Dubnicki , E. W. Felten , J. Sandberg, Virtual memory mapped network interface for the SHRIMP multicomputer, Proceedings of the 21ST annual international symposium on Computer architecture, p.142-153, April 18-21, 1994, Chicago, Illinois, United States
|
 |
BLRC94
|
Brian N. Bershad , Dennis Lee , Theodore H. Romer , J. Bradley Chen, Avoiding conflict misses dynamically in large direct-mapped caches, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.158-170, October 05-07, 1994, San Jose, California, United States
|
| |
BM89
|
|
 |
CDV+94
|
Rohit Chandra , Scott Devine , Ben Verghese , Anoop Gupta , Mendel Rosenblum, Scheduling and page migration for multiprocessor compute servers, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.12-24, October 05-07, 1994, San Jose, California, United States
|
 |
CMCH91
|
William Y. Chen , Scott A. Mahlke , Pohua P. Chang , Wen-mei W. Hwu, Data access microarchitectures for superscalar processors with compiler-assisted data prefetching, Proceedings of the 24th annual international symposium on Microarchitecture, p.69-73, September 1991, Albuquerque, New Mexico, Puerto Rico
[doi> 10.1145/123465.123478]
|
 |
CMM+88
|
R. C. Covington , S. Madala , V. Mehta , J. R. Jump , J. B. Sinclair, The rice parallel processing testbed, Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems, p.4-11, May 24-27, 1988, Santa Fe, New Mexico, United States
|
| |
Dix92
|
|
| |
DBKF90
|
J. J. Dongarra , Orlie Brewer , James Arthur Kohl , Samuel Fineberg, A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors, Journal of Parallel and Distributed Computing, v.9 n.2, p.185-202, June 1990
[doi> 10.1016/0743-7315(90)90045-Q]
|
| |
DEC92
|
Digital Equipment Corp. DECChip 21064 RISC Microprocessor Preliminary Data Sheet. Technical report, 1992.
|
 |
ECGS92
|
Thorsten von Eicken , David E. Culler , Seth Copen Goldstein , Klaus Erik Schauser, Active messages: a mechanism for integrated communication and computation, Proceedings of the 19th annual international symposium on Computer architecture, p.256-266, May 19-21, 1992, Queensland, Australia
|
| |
ERB+95
|
John H. Edmondson , Paul I. Rubinfeld , Peter J. Bannon , Bradley J. Benschneider , Debra Bernstein , Ruben W. Castelino , Elizabeth M. Cooper , Daniel E. Dever , Dale R. Donchin , Timothy C. Fischer , Anil K. Jain , Shekhar Mehta , Jeanne E. Meyer , Ronald P. Preston , Vidya Rajagopalan , Chandrasekhara Somanathan , Scott A. Taylor , Gilbert M. Wolrich, Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor, Digital Technical Journal, v.7 n.1, p.119-135, Jan. 1995
|
 |
FJ94
|
|
| |
GH93
|
|
| |
GJMS87
|
K. Gallivan, W. Jalby, U. Meier, and A. Sameh. The Impact of Hierarchical Memory Systems on Linear Algebra Algorithm Design. Technical Report UIUCSRD 625, Univ. of Illinois, 1987.
|
| |
HMMS95
|
|
| |
JHei95
|
Joe Heinrich. MIPS R10000 Microprocessor User's Manual. 1995.
|
 |
Jou90
|
|
 |
KOH+94
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
LGH94
|
James Laudon , Anoop Gupta , Mark Horowitz, Interleaving: a multithreading technique targeting multiprocessors and workstations, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.308-318, October 05-07, 1994, San Jose, California, United States
|
| |
LW94
|
|
| |
Mat94
|
Terje Mathison. Pentium Secrets. Byte, pp 191-192, July 1994.
|
| |
MGA95
|
|
 |
MLG92
|
Todd C. Mowry , Monica S. Lam , Anoop Gupta, Design and evaluation of a compiler algorithm for prefetching, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.62-73, October 12-15, 1992, Boston, Massachusetts, United States
|
| |
NAB+94
|
A. Nowatzyk, G. Aybay, M. Browne, et al. The S3.mp Scalable Shared Memory Multiprocessor. Proc. 27th Hawaii Intl. Conf. on System Sciences Vol. I: Architecture. pp 144-53. Jan, 1994.
|
| |
Pau94
|
|
| |
Por89
|
|
 |
RLW94
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21ST annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
 |
SFL+94
|
Ioannis Schoinas , Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , James R. Larus , David A. Wood, Fine-grain access control for distributed shared memory, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.297-306, October 05-07, 1994, San Jose, California, United States
|
 |
SG94
|
|
| |
Smi81
|
B. J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. SPIE Real-Time Signal Processing IV, Vol. 298, 1981.
|
 |
TE94
|
|
 |
TL94
|
|
 |
WL91
|
|
CITED BY 24
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jeffrey Dean , James E. Hicks , Carl A. Waldspurger , William E. Weihl , George Chrysos, ProfileMe: hardware support for instruction-level profiling on out-of-order processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.292-302, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jaydeep Marathe , Frank Mueller , Tushar Mohan , Bronis R. de Supinski , Sally A. McKee , Andy Yoo, METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, March 23-26, 2003, San Francisco, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marco Zagha , Brond Larson , Steve Turner , Marty Itzkowitz, Performance analysis using the MIPS R10000 performance counters, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p.16-es, January 01-01, 1996, Pittsburgh, Pennsylvania, United States
|
|
|
Eric Tune , Rakesh Kumar , Dean M. Tullsen , Brad Calder, Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.183-194, December 04-08, 2004, Portland, Oregon
|
|
|
Håkan Zeffer , Zoran Radović , Martin Karlsson , Erik Hagersten, TMA: a trap-based memory architecture, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
|
|
|
Jaydeep Marathe , Frank Mueller , Tushar Mohan , Sally A. Mckee , Bronis R. De Supinski , Andy Yoo, METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies, ACM Transactions on Programming Languages and Systems (TOPLAS), v.29 n.2, p.12-es, April 2007
|
|
|
|
|
|
|
|
|
Henry Wong , Anne Bracy , Ethan Schuchman , Tor M. Aamodt , Jamison D. Collins , Perry H. Wang , Gautham Chinya , Ankur Khandelwal Groen , Hong Jiang , Hong Wang, Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|