|
ABSTRACT
Level-one caches normally reside on a processor's critical path, which determines clock frequency. Therefore, fast access to level-one cache is important. Direct-mapped caches exhibit faster access time, but poor hit rates, compared with same sized set-associative caches because of nonuniform accesses to the cache sets. The nonuniform accesses generate more cache misses in some sets, while other sets are underutilized. We propose to increase the decoder length and, hence, reduce the accesses to heavily used sets without dynamically detecting the cache set usage information. We increase the access to the underutilized cache sets by incorporating a replacement policy into the cache design using programmable decoders. On average, the proposed techniques achieve as low a miss rate as a traditional 4-way cache on all 26 SPEC2K benchmarks for the instruction and data caches, respectively. This translates into an average IPC improvement of 21.5 and 42.4% for SPEC2K integer and floating-point benchmarks, respectively. The B-Cache consumes 10.5% more power per access, but exhibits a 12% total memory access-related energy savings as a result of the miss rate reductions, and, hence, the reduction to applications' execution time. Compared with previous techniques that aim at reducing the miss rate of direct-mapped caches, our technique requires only one cycle to access all cache hits and has the same access time of a direct-mapped cache.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
3
|
|
 |
4
|
Brian N. Bershad , Dennis Lee , Theodore H. Romer , J. Bradley Chen, Avoiding conflict misses dynamically in large direct-mapped caches, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.158-170, October 05-07, 1994, San Jose, California, United States
|
| |
5
|
Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. Univ. of Wisconsin-Madison Computer Sciences Dept. Technical Report #1342, June.
|
| |
6
|
Cadence Corporation. http://www.cadence.com
|
| |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
|
 |
11
|
Krisztián Flautner , Nam Sung Kim , Steve Martin , David Blaauw , Trevor Mudge, Drowsy caches: simple techniques for reducing leakage power, Proceedings of the 29th annual international symposium on Computer architecture, May 25-29, 2002, Anchorage, Alaska
|
 |
12
|
Kanad Ghose , Milind B. Kamble, Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation, Proceedings of the 1999 international symposium on Low power electronics and design, p.70-75, August 16-17, 1999, San Diego, California, United States
[doi> 10.1145/313817.313860]
|
 |
13
|
|
 |
14
|
|
 |
15
|
Toni Juan , Tomás Lang , Juan J. Navarro, The difference-bit cache, Proceedings of the 23rd annual international symposium on Computer architecture, p.114-120, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
16
|
|
| |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
Naffziger, S. D., Colon-Bonet, G., Fischer, T., Riedlinger, R., Sullivan, T. J., and Grutkowski, T. 2002. The implementation of the Itanium 2 microprocessor. IEEE Journal of Solid-State Circuits 37, 11.
|
 |
21
|
Jih-Kwon Peir , Windsor W. Hsu , Honesty Young , Shauchi Ong, Improving cache performance with balanced tag and data paths, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.268-278, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
22
|
Jih-Kwon Peir , Yongjoon Lee , Windsor W. Hsu, Capturing dynamic memory reference behavior with adaptive cache topology, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.240-250, October 02-07, 1998, San Jose, California, United States
|
| |
23
|
Reinmann, G. and Jouppi, N. P. 1999. CACTI2.0: An Integrated Cache Timing and Power Model. COMPAQ Western Research Lab.
|
| |
24
|
Santhanam, S. et al. 1998. A low-cost, 300-MHz, RISC CPU with attached media processor. IEEE Journal of Solid-State Circuits 33, 11, 1829--1839.
|
 |
25
|
|
 |
26
|
|
| |
27
|
Standard Performance Evaluation Corporation. http://www.specbench.org/osg/cpu2000/.
|
| |
28
|
Sun Microsystems, Inc. 2006. http://www.sun.com/servers/entry/v210/datasheet.pdf.
|
| |
29
|
Weaver, C. T. Pre-compiled SPEC2000 Alpha Binaries. Available at: http://www.simplescalar.org.
|
 |
30
|
|
 |
31
|
|
| |
32
|
Yoshimoto, M. et al. 1983. A divided word-line structure in the static RAM and its application to a 64k full CMOS RAM. IEEE J. Solid-State Circuits SC-21, 479--485.
|
 |
33
|
|
 |
34
|
|
 |
35
|
|
| |
36
|
|
|