|
ABSTRACT
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Santosh G. Abraham , Rabin A. Sugumar , Daniel Windheiser , B. R. Rau , Rajiv Gupta, Predictability of load/store instruction latencies, Proceedings of the 26th annual international symposium on Microarchitecture, p.139-152, December 01-03, 1993, Austin, Texas, United States
|
| |
2
|
Forest Baskett. Keynote address. International Symposium on Shared Memory MuItiprocessing, April 1991.
|
| |
3
|
L.A. Belady A Study of Replacement Algorithms for a Virtual- Storage Computer. IBM Systems Journal, 5(2):78-101, I966.
|
| |
4
|
Doug Burger and Todd M. Austin. Evaluating Future Microprocessors: the SimpleScalar Tool Set. Technical Report 1300, Computer Sciences Department, University of Wisconsin, Madison, WI, April 1996.
|
| |
5
|
Douglas C. Burger, Alain Kagi, and James R. Goodman. The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors. Technical Report 1261, Computer Sciences Department, University of Wisconsin, Madison, WI, January 1995.
|
 |
6
|
David Callahan , Ken Kennedy , Allan Porterfield, Software prefetching, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.40-52, April 08-11, 1991, Santa Clara, California, United States
|
 |
7
|
|
 |
8
|
William Y. Chen , Scott A. Mahlke , Pohua P. Chang , Wen-mei W. Hwu, Data access microarchitectures for superscalar processors with compiler-assisted data prefetching, Proceedings of the 24th annual international symposium on Microarchitecture, p.69-73, September 1991, Albuquerque, New Mexico, Puerto Rico
[doi> 10.1145/123465.123478]
|
| |
9
|
|
 |
10
|
Robert P. Colwell , Robert P. Nix , John J. O'Donnell , David B. Papworth , Paul K. Rodman, A VLIW architecture for a trace scheduling compiler, Proceedings of the second international conference on Architectual support for programming languages and operating systems, p.180-192, October 1987, Palo Alto, California, United States
|
| |
11
|
Stefanos Damianakis, Kai Li, and Anne Rogers. An Analysis of a Combined Hardware-Software Mechanism for Speculative Loads. Technical Report TR-455-94, Princeton University, Princeton, NJ, April 1994.
|
 |
12
|
|
 |
13
|
|
 |
14
|
John W. C. Fu , Janak H. Patel , Bob L. Janssens, Stride directed prefetching in scalar processors, Proceedings of the 25th annual international symposium on Microarchitecture, p.102-110, December 01-04, 1992, Portland, Oregon, United States
|
| |
15
|
Hector Garcia-Molina, Richard J. Lipton, and Jacobo Valdes. A Massive Memory Machine. IEEE Transactions on Computers, C- 33(5):391-399, May 1984.
|
 |
16
|
|
| |
17
|
J.D. Gindele. Buffer Block Prefetching Method. IBM Tech. Disclosure Bull., 20(2):696--697, July 1977.
|
 |
18
|
|
 |
19
|
|
 |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
L. I. Kontothanassis , R. A. Sugumar , G. J. Faanes , J. E. Smith , M. L. Scott, Cache performance in vector supercomputers, Proceedings of the 1994 conference on Supercomputing, p.255-264, December 1994, Washington, D.C., United States
|
| |
28
|
|
 |
29
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
 |
30
|
James Laudon , Anoop Gupta , Mark Horowitz, Interleaving: a multithreading technique targeting multiprocessors and workstations, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.308-318, October 05-07, 1994, San Jose, California, United States
|
 |
31
|
|
 |
32
|
Todd C. Mowry , Monica S. Lam , Anoop Gupta, Design and evaluation of a compiler algorithm for prefetching, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.62-73, October 12-15, 1992, Boston, Massachusetts, United States
|
 |
33
|
|
| |
34
|
|
 |
35
|
|
 |
36
|
Edward Rothberg , Jaswinder Pal Singh , Anoop Gupta, Working sets, cache sizes, and node granularity issues for large-scale multiprocessors, Proceedings of the 20th annual international symposium on Computer architecture, p.14-26, May 16-19, 1993, San Diego, California, United States
|
 |
37
|
|
| |
38
|
Burton J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. In Real-Time Signal Processing IV, pages 241-248, 1981.
|
 |
39
|
|
 |
40
|
|
| |
41
|
|
| |
42
|
Standard Performance Evaluation Corporation. SPEC Newsletter, Fairfax, Virginia, December 1991.
|
| |
43
|
Standard Performance Evaluation Corporation. SPEC Newsletter, Fairfax, Virginia, September 1995.
|
 |
44
|
|
| |
45
|
Gary Tyson , Matthew Farrens , John Matthews , Andrew R. Pleszkun, A modified approach to data cache management, Proceedings of the 28th annual international symposium on Microarchitecture, p.93-103, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
CITED BY 59
|
|
|
|
|
Ananth Hegde , N. Vijaykrishnan , Mahmut Kandemir , Mary Jane Irwin, VL-CDRAM: variable line sized cached DRAMs, Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, October 01-03, 2003, Newport Beach, CA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jaffrey Draper , J. Tim Barrett , Jeff Sondeen , Sumit Mediratta , Chang Woo Kang , Ihn Kim , Gokhan Daglikoca, A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System, Journal of VLSI Signal Processing Systems, v.40 n.1, p.73-84, May 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Patterson , Thomas Anderson , Neal Cardwell , Richard Fromm , Kimberly Keeton , Christoforos Kozyrakis , Randi Thomas , Katherine Yelick, A Case for Intelligent RAM, IEEE Micro, v.17 n.2, p.34-44, March 1997
|
|
|
|
|
|
|
|
|
|
|
|
Mary Hall , Peter Kogge , Jeff Koller , Pedro Diniz , Jacqueline Chame , Jeff Draper , Jeff LaCoss , John Granacki , Jay Brockman , Apoorv Srivastava , William Athas , Vincent Freeh , Jaewook Shin , Joonseok Park, Mapping irregular applications to DIVA, a PIM-based data-intensive architecture, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.57-es, November 14-19, 1999, Portland, Oregon, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lixin Zhang , Zhen Fang , Mide Parker , Binu K. Mathew , Lambert Schaelicke , John B. Carter , Wilson C. Hsieh , Sally A. McKee, The Impulse Memory Controller, IEEE Transactions on Computers, v.50 n.11, p.1117-1132, November 2001
|
|
|
Christoforos E. Kozyrakis , Stylianos Perissakis , David Patterson , Thomas Anderson , Krste Asanovic , Neal Cardwell , Richard Fromm , Jason Golbus , Benjamin Gribstad , Kimberly Keeton , Randi Thomas , Noah Treuhaft , Katherine Yelick, Scalable Processors in the Billion-Transistor Era: IRAM, Computer, v.30 n.9, p.75-78, September 1997
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
John B. Carter , Wilson C. Hsieh , Leigh B. Stoller , Mark Swanson , Lixin Zhang , Sally A. McKee, Impulse: Memory system support for scientific applications, Scientific Programming, v.7 n.3-4, p.195-209, August 1999
|
|
|
|
|
|
Jafar Adibi , Tim Barrett , Spundun Bhatt , Hans Chalupsky , Jacqueline Chame , Mary Hall, Processing-in-memory technology for knowledge discovery algorithms, Proceedings of the 2nd international workshop on Data management on new hardware, June 25-25, 2006, Chicago, Illinois
|
|
|
|
|
|
Jeff Draper , Jacqueline Chame , Mary Hall , Craig Steele , Tim Barrett , Jeff LaCoss , John Granacki , Jaewook Shin , Chun Chen , Chang Woo Kang , Ihn Kim , Gokhan Daglikoca, The architecture of the DIVA processing-in-memory chip, Proceedings of the 16th international conference on Supercomputing, June 22-26, 2002, New York, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
John H. Kelm , Daniel R. Johnson , Matthew R. Johnson , Neal C. Crago , William Tuohy , Aqeel Mahesri , Steven S. Lumetta , Matthew I. Frank , Sanjay J. Patel, Rigel: an architecture and scalable programming interface for a 1000-core accelerator, ACM SIGARCH Computer Architecture News, v.37 n.3, June 2009
|
|