|
ABSTRACT
The growing dominance of wire delays at future technology points renders a microprocessor communication-bound. Clustered microarchitectures allow most dependence chains to execute without being affected by long on-chip wire latencies. They also allow faster clock speeds and reduce design complexity, thereby emerging as a popular design choice for future microprocessors. However, a centralized data cache threatens to be the primary bottle-neck in highly clustered systems. The paper attempts to identify the most complexity-effective approach to alleviate this bottleneck. While decentralized cache organizations have been proposed, they introduce excessive logic and wiring complexity. The paper evaluates if the performance gains of a decentralized cache are worth the increase in complexity. We also introduce and evaluate the behavior of Cluster Prefetch - the forwarding of data values to a cluster through accurate address prediction. Our results show that the success of this technique depends on accurate speculation across unresolved stores. The technique applies for a wide class of processor models and most importantly, it allows high performance even while employing a simple centralized data cache. We conclude that address prediction holds more promise for future wire-delay-limited processors than decentralized cache organizations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
A. Aggarwal and M. Franklin. An Empirical Study of the Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors. In Proceedings of ISPASS, 2001.
|
| |
3
|
|
| |
4
|
P. Ahuja, J. Emer, A. Klauser, and S. Mukherjee. Performance Potential of Effective Address Prediction of Load Instructions. In Proceedings of Workshop on Memory Performance Issues (in conjunction with ISCA-28), June 2001.
|
 |
5
|
|
 |
6
|
Amirali Baniasadi , Andreas Moshovos, Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors, Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, p.337-347, December 2000, Monterey, California, United States
[doi> 10.1145/360128.360165]
|
 |
7
|
Rajeev Barua , Walter Lee , Saman Amarasinghe , Anant Agarwal, Maps: a compiler-managed memory system for raw machines, Proceedings of the 26th annual international symposium on Computer architecture, p.4-15, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
8
|
Michael Bekerman , Stephan Jourdan , Ronny Ronen , Gilad Kirshenboim , Lihu Rappoport , Adi Yoaz , Uri Weiser, Correlated load-address predictors, Proceedings of the 26th annual international symposium on Computer architecture, p.54-63, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
9
|
Bryan Black , Brian Mueller , Stephanie Postal , Ryan Rakvic , Noppanunt Utamaphethai , John Paul Shen, Load execution latency reduction, Proceedings of the 12th international conference on Supercomputing, p.29-36, July 1998, Melbourne, Australia
[doi> 10.1145/277830.277842]
|
| |
10
|
D. Burger and T. Austin. The Simplescalar Toolset, Version 2.0. Technical Report TR-97-1342, University of Wisconsin-Madison, June 1997.
|
| |
11
|
R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Cluster Assignment Mechanisms. In Proceedings of HPCA-6, pages 132--142, January 2000.
|
| |
12
|
|
 |
13
|
|
| |
14
|
Keith I. Farkas , Paul Chow , Norman P. Jouppi , Zvonko Vranesic, The multicluster architecture: reducing cycle time through partitioning, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.149-159, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Q1, 2001.
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
Mikko H. Lipasti , Christopher B. Wilkerson , John Paul Shen, Value locality and load value prediction, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.138-147, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
24
|
Andreas Moshovos , Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th annual international symposium on Computer architecture, p.181-193, June 01-04, 1997, Denver, Colorado, United States
|
| |
25
|
|
 |
26
|
Kunle Olukotun , Basem A. Nayfeh , Lance Hammond , Ken Wilson , Kunyung Chang, The case for a single-chip multiprocessor, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.2-11, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
27
|
Subbarao Palacharla , Norman P. Jouppi , J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th annual international symposium on Computer architecture, p.206-218, June 01-04, 1997, Denver, Colorado, United States
|
 |
28
|
|
| |
29
|
|
 |
30
|
|
 |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
P. Shivakumar and N. P. Jouppi. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report TN-2001/2, Compaq Western Research Laboratory, August 2001.
|
| |
36
|
J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 System Microarchitecture. Technical report, Technical White Paper, IBM, October 2001.
|
| |
37
|
|
| |
38
|
|
|