| Compiler managed micro-cache bypassing for high performance EPIC processors |
| Full text |
Publisher Site
,
Pdf
(1.15 MB)
|
| Source
|
International Symposium on Microarchitecture
archive
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
table of contents
Istanbul, Turkey
SESSION: Compiler scheduling
table of contents
Pages: 134 - 145
Year of Publication: 2002
ISBN ~ ISSN:1072-4451 , 0-7695-1859-1
|
|
Authors
|
|
Youfeng Wu
|
Intel Corporation, Santa Clara, CA
|
|
Ryan Rakvic
|
Intel Corporation, Santa Clara, CA
|
|
Li-Ling Chen
|
Intel Corporation, Santa Clara, CA
|
|
Chyi-Chang Miao
|
Intel Corporation, Shrewsbury, MA
|
|
George Chrysos
|
Intel Corporation, Shrewsbury, MA
|
|
Jesse Fang
|
Intel Corporation, Santa Clara, CA
|
|
| Sponsors |
|
| Publisher |
IEEE Computer Society Press
Los Alamitos, CA, USA
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 12, Citation Count: 4
|
|
|
ABSTRACT
Advanced microprocessors have been increasing clock rates, well beyond the Gigahertz boundary. For such high performance microprocessors, a small and fast data micro cache (ucache) is important to overall performance, and proper management of it via load bypassing has a significant performance impact. In this paper, we propose and evaluate a hardware-software collaborative technique to manage ucache bypassing for EPIC processors. The hardware supports the ucache bypassing with a flag in the load instruction format, and the compiler employs static analysis and profiling to identify loads that should bypass the ucache. The collaborative method achieves a significant improvement in performance for the SpecInt2000 benchmarks. On average, about 40%, 30%, 24%, and 22% of load references are identified to bypass 256B, 1K, 4K, and 8K sized ucaches, respectively. This reduces the ucache miss rates by 39%, 32%, 28%, and 26%. The number of pipeline stalls from loads to their uses is reduced by 13%, 9%, 6%, and 5%. Meanwhile, the L1 and L2 cache misses remain largely unchanged. For the 256B ucache, bypassing improves overall performance on average by 5%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Chi, C. H.; Dietz, H., "Improving cache performance by selective cache bypass," Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences, 1989. Vol. I: Architecture Track, 1989, pp 277--285 vol. 1
|
| |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
Kurd, N.A.; Barkarullah, J.S.; Dizon, R.O.; Fletcher, T.D.; Madland, P.D. "A multigigahertz clocking scheme for the Pentium(R) 4 microprocessor," IEEE Journal of Solid-State Circuits, Volume: 36 Issue: 11, Nov. 2001 pp 1647--1653
|
| |
11
|
|
 |
12
|
Hsien-Hsin S. Lee , Gary S. Tyson, Region-based caching: an energy-delay efficient memory architecture for embedded processors, Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems, p.120-127, November 17-19, 2000, San Jose, California, United States
[doi> 10.1145/354880.354898]
|
| |
13
|
Livadas, P.E.; Roy, P.K., "Program dependence analysis," Proceedings of Conference on Software Maintenance, 1992, pp 356--365
|
 |
14
|
Scott A. Mahlke , William Y. Chen , Roger A. Bringmann , Richard E. Hank , Wen-Mei W. Hwu , B. Ramakrishna Rau , Michael S. Schlansker, Sentinel scheduling: a model for compiler-controlled speculative execution, ACM Transactions on Computer Systems (TOCS), v.11 n.4, p.376-408, Nov. 1993
[doi> 10.1145/161541.159765]
|
| |
15
|
Pyreddy R. and G. Tyson, "Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints," Workshop on Complexity-Effective Design (WCED), Anchorage, Alaska, USA, May 25--26, 2002
|
| |
16
|
Rakvic, Ryan; Black, Bryan; Limaye, Deepak; and Shen, John. "Non-vital Loads," HPCA 8, Feb 2--6, 2002.
|
| |
17
|
|
| |
18
|
|
 |
19
|
Srikanth T. Srinivasan , Roy Dz-ching Ju , Alvin R. Lebeck , Chris Wilkerson, Locality vs. criticality, Proceedings of the 28th annual international symposium on Computer architecture, p.132-143, June 30-July 04, 2001, Göteborg, Sweden
|
| |
20
|
Artour Stoutchinin , José N. Amaral , Guang R. Gao , James C. Dehnert , Suneel Jain , Alban Douillet, Speculative Prefetching of Induction Pointers, Proceedings of the 10th International Conference on Compiler Construction, p.289-303, April 02-06, 2001
|
| |
21
|
|
| |
22
|
|
| |
23
|
Gary Tyson , Matthew Farrens , John Matthews , Andrew R. Pleszkun, A modified approach to data cache management, Proceedings of the 28th annual international symposium on Microarchitecture, p.93-103, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
| |
24
|
Uhlig, R.; R. Fishtein; O. Gershon; I. Hirsh; H. Wang, "SoftSDV: A Pre-silicon Software Development Environment for the IA-64 Architecture", Intel Technology Journal Q4 1999, http://www.intel.com/technology/itj/q41999/articles/art_2.htm
|
 |
25
|
|
CITED BY 4
|
|
|
|
|
|
|
|
Lingxiang Xiang , Tianzhou Chen , Qingsong Shi , Wei Hu, Less reused filter: improving l2 cache performance via filtering less reused lines, Proceedings of the 23rd international conference on Supercomputing, June 08-12, 2009, Yorktown Heights, NY, USA
|
|
|
|
|