|
ABSTRACT
The high clock frequencies of modern superscalar processors make the wire delay incurred in moving data across the processor chip a significant concern. As frequencies continue to increase, it will become more difficult for a centralized first level data cache to supply the timely data bandwidth required by superscalar processors.This paper presents a complete solution for the partitioning of the first level of the memory hierarchy. The first level data cache is split into several independent partitions, which are arbitrarily distributable across the processor die. After being decoded, memory instructions are sent to the reservation stations of the functional unit adjacent to the cache partition that they are most likely to access. The partition assignments for both static instructions and cache data are dynamically changed to adapt to data access patterns. A data cache line is permitted to reside in only one partition at a time, allowing each store to update only a single partition, and allowing the partitioning and simplification of the memory disambiguation logic. The partitioned cache achieves a reduction in cache access latency through a combination of reduced wire delay and reduced cache array size. A partitioned cache with eight 8KB direct-mapped partitions maintains a hit rate greater than that of a 32KB direct-mapped cache. A machine utilizing the partitioned cache outperforms a machine with a conventional 64KB direct-mapped cache by 4.5% and a machine with a 64KB 8-way set-associative cache by 7.0%, when cache latencies estimated through the use of the CACTI cache simulation tool are taken into account.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
Sangyeun Cho , Pen-Chung Yew , Gyungho Lee, Decoupling local variable accesses in a wide-issue superscalar processor, Proceedings of the 26th annual international symposium on Computer architecture, p.100-110, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
7
|
Keith I. Farkas , Paul Chow , Norman P. Jouppi , Zvonko Vranesic, The multicluster architecture: reducing cycle time through partitioning, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.149-159, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
8
|
M. Franklin. The multiscalar architecture. Technical Report 1196, Computer Sciences Department, University of Wisconsin - Madison, Nov. 1993.
|
| |
9
|
L. Gwennap. Digital 21264 sets new standard. Microprocessor Report, pages 11--16, Oct. 1996.
|
| |
10
|
H. V. Henk~Neefs and K. D. Bosschere. A technique for high bandwidth and deterministic low latency load/store accesses to multiple cache banks. In Proceedings of the Sixth IEEE International Symposium on High Performance Computer Architecture, pages 313--324, 2000.
|
 |
11
|
Toni Juan , Tomás Lang , Juan J. Navarro, The difference-bit cache, Proceedings of the 23rd annual international symposium on Computer architecture, p.114-120, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
12
|
|
| |
13
|
|
 |
14
|
Subbarao Palacharla , Norman P. Jouppi , J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th annual international symposium on Computer architecture, p.206-218, June 01-04, 1997, Denver, Colorado, United States
|
| |
15
|
G. Reinman and N. P. Jouppi. Cacti 2.0: An integrated cache timing and power model. Technical report, Western Research Laboratory, 2000.
|
| |
16
|
|
| |
17
|
|
 |
18
|
Adi Yoaz , Mattan Erez , Ronny Ronen , Stephan Jourdan, Speculation techniques for improving load related instruction scheduling, Proceedings of the 26th annual international symposium on Computer architecture, p.42-53, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
19
|
|
CITED BY 6
|
|
|
|
|
|
|
|
|
|
|
Jason Cong , Ashok Jagannathan , Glenn Reinman , Yuval Tamir, Understanding the energy efficiency of SMT and CMP with multiclustering, Proceedings of the 2005 international symposium on Low power electronics and design, August 08-10, 2005, San Diego, CA, USA
|
|
|
|
|
|
|
|