|
ABSTRACT
Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Peter Bergner , Peter Dahl , David Engebretsen , Matthew O'Keefe, Spill code minimization via interference region spilling, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.287-295, June 16-18, 1997, Las Vegas, Nevada, United States
|
 |
2
|
P. Briggs , K. D. Cooper , K. Kennedy , L. Torczon, Coloring heuristics for register allocation, Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, p.275-284, June 19-23, 1989, Portland, Oregon, United States
|
| |
3
|
D. Burger and T. M. Austin, "The SimpleScalar Tool Set, Version 2.0~" Computer Sciences Department Technical Report, No. 1342, Univ, of Wisconsin, June 1997.
|
 |
4
|
|
| |
5
|
S Cho, P.-C. Yew, and G. Lee. "Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor," Technical Repoa #98-020, Dept, of Computer Sci. and Eng., Univ, of Minnesota, May 1998.
|
| |
6
|
S. Cho, P.-C. Yew, and G. Lee. "'Access Region Locality for High- Bandwidth Processor Memory System Design," Technical Report #99-004, Dept. of Computer Sci. and Eng., Univ. of Minnesota, Feb. 1999.
|
 |
7
|
|
 |
8
|
|
| |
9
|
John H. Edmondson , Paul I. Rubinfeld , Peter J. Bannon , Bradley J. Benschneider , Debra Bernstein , Ruben W. Castelino , Elizabeth M. Cooper , Daniel E. Dever , Dale R. Donchin , Timothy C. Fischer , Anil K. Jain , Shekhar Mehta , Jeanne E. Meyer , Ronald P. Preston , Vidya Rajagopalan , Chandrasekhara Somanathan , Scott A. Taylor , Gilbert M. Wolrich, Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor, Digital Technical Journal, v.7 n.1, p.119-135, Jan. 1995
|
| |
10
|
EGCS Projecl. ht tp : / / egcs. cygnus, com.
|
 |
11
|
|
| |
12
|
M.J. Flynn and L. W. Hoevel. "Execution Architecture: The DEL- tran Experiment," IEEE Trans. on Computers, C-32(2): 156-175, Feb. 1983.
|
| |
13
|
L. Gwennap. "'Digital 21264 Sets New Standard," Microprocessor Report, Volume 10, Issue 14, Oct. 1996.
|
| |
14
|
|
| |
15
|
|
 |
16
|
Andreas Moshovos , Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th annual international symposium on Computer architecture, p.181-193, June 01-04, 1997, Denver, Colorado, United States
|
| |
17
|
|
 |
18
|
Subbarao Palacharla , Norman P. Jouppi , J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th annual international symposium on Computer architecture, p.206-218, June 01-04, 1997, Denver, Colorado, United States
|
| |
19
|
Yale N. Patt , Sanjay J. Patel , Marius Evers , Daniel H. Friendly , Jared Stark, One Billion Transistors, One Uniprocessor, One Chip, Computer, v.30 n.9, p.51-57, September 1997
[doi> 10.1109/2.612249]
|
| |
20
|
D.A. Patterson and C. H. Sequin. "A VLSI RISC," IEEE Computer, pp. 8 - 2I, Sept. 1982.
|
| |
21
|
Jude A. Rivers , Gary S. Tyson , Edward S. Davidson , Todd M. Austin, On high-bandwidth data cache design for multi-issue processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.46-56, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
22
|
|
| |
23
|
Eric Rotenberg , Quinn Jacobson , Yiannakis Sazeides , Jim Smith, Trace processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.138-148, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
The Standard Performance Evaluation Corporation, http: //www. specbench, org.
|
| |
28
|
Y. Tamir and C. H. Sequin. "'Strategies for Managing the Register File in RISC," IEEE Trans. on Computers, C-32(11): 977 - 989, Nov. 1983.
|
| |
29
|
|
| |
30
|
|
 |
31
|
Kenneth M. Wilson , Kunle Olukotun , Mendel Rosenblum, Increasing cache port efficiency for dynamic superscalar microprocessors, Proceedings of the 23rd annual international symposium on Computer architecture, p.147-157, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
32
|
|
| |
33
|
|
 |
34
|
|
CITED BY 14
|
|
Michael Bekerman , Adi Yoaz , Freddy Gabbay , Stephan Jourdan , Maxim Kalaev , Ronny Ronen, Early load address resolution via register tracking, ACM SIGARCH Computer Architecture News, v.28 n.2, p.306-315, May 2000
|
|
|
|
|
|
|
|
|
Michael Huang , Jose Renau , Seung-Moon Yoo , Josep Torrellas, L1 data cache decomposition for energy efficiency, Proceedings of the 2001 international symposium on Low power electronics and design, p.10-15, August 2001, Huntington Beach, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David W. Oehmke , Nathan L. Binkert , Trevor Mudge , Steven K. Reinhardt, How to Fake 1000 Registers, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.7-18, November 12-16, 2005, Barcelona, Spain
|
|