|
ABSTRACT
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. This work presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in-order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. Our study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. Since one of the problems of multithreading is the degradation of the memory system performance, both in terms of miss latency and bandwidth requirements, this improvement becomes critical for high miss latencies, where bandwidth might become a bottleneck. Finally, although it may seem rather surprising, our study reveals that multithreading by itself exhibits little memory latency tolerance. Our results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Berrached, L.D. Coraor, and PT. Hulina, "A Decoupled Access/Execute Architecture for Efficient Access of Structured Data," Proc. 26th Hawai Int'l Conf. System Sciences, vol. 1, pp. 438447, Jan. 1993.
|
 |
2
|
|
| |
3
|
|
| |
4
|
G.E. Daddis and H.C. Torng, "The Concurrent Execution of Multiple Execution Streams on Superscalar Processors," Proc. Int'l Conf. Parallel Processing, pp. 76-83, Aug. 1991.
|
| |
5
|
Digital Equipment Corp., Alpha 21164 Microprocessor Hardware Reference Manual, Or. Num. EC-QAEQB-TE, Maynard, Mass., Apr. 1995.
|
| |
6
|
Keith I. Farkas , Paul Chow , Norman P. Jouppi , Zvonko Vranesic, The multicluster architecture: reducing cycle time through partitioning, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.149-159, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
7
|
J. R. Goodman , Jian-tu Hsieh , Koujuch Liou , Andrew R. Pleszkun , P. B. Schechter , Honesty C. Young, PIPE: a VLSI decoupled architecture, Proceedings of the 12th annual international symposium on Computer architecture, p.20-27, June 17-19, 1985, Boston, Massachusetts, United States
|
| |
8
|
L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, vol. 9, no. 2, pp. 9-15, Feb. 1995.
|
| |
9
|
L Gwennap, 'Digital 21264 Sets New Standard," Microprocessor Report, vol. 10, no. 14, Oct. 1996.
|
 |
10
|
Hiroaki Hirata , Kozo Kimura , Satoshi Nagamine , Yoshiyuki Mochizuki , Akio Nishimura , Yoshimori Nakase , Teiji Nishizawa, An elementary processor architecture with simultaneous instruction issuing from multiple threads, Proceedings of the 19th annual international symposium on Computer architecture, p.136-145, May 19-21, 1992, Queensland, Australia
|
| |
11
|
|
| |
12
|
M. Johnson, Superscalar Microprocessor Design. Englewood Cliffs, N.J.: Prentice Hall, 1991.
|
| |
13
|
|
 |
14
|
|
| |
15
|
G.A. Kemp and lvi Franklin," PEWs: A Decentralized Dynamic Scheduler for ILP Processing," Proc. Intl Con,! Parallel Processing, vol. 1, pp. 239-246, 1996.
|
| |
16
|
|
| |
17
|
A. Kumar, "The HP-PA8000 RISC CPU: A High Performance Outof-Order Processor," Proc. Hot Chips VIII, pp. 9-20, Aug. 1996.
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
Subbarao Palacharla , Norman P. Jouppi , J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th annual international symposium on Computer architecture, p.206-218, June 01-04, 1997, Denver, Colorado, United States
|
| |
22
|
|
| |
23
|
AR. Pleszkun and E.S. Davidson, "Structured Memory Access Architecture," Proc. 1983 Int'l Conf. Parallel Processing, pp. 461-471, Aug. 1983.
|
| |
24
|
Eric Rotenberg , Quinn Jacobson , Yiannakis Sazeides , Jim Smith, Trace processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.138-148, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
25
|
S. Subramanya Sastry , Subbarao Palacharla , James E. Smith, Exploiting idle floating-point resources for integer execution, Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, p.118-129, June 17-19, 1998, Montreal, Quebec, Canada
|
| |
26
|
|
 |
27
|
|
 |
28
|
J. E. Smith , G. E. Dermer , B. D. Vanderwarn , S. D. Klinger , C. M. Rozewski, The ZS-1 central processor, Proceedings of the second international conference on Architectual support for programming languages and operating systems, p.199-204, October 1987, Palo Alto, California, United States
|
 |
29
|
|
| |
30
|
|
 |
31
|
|
 |
32
|
|
 |
33
|
|
| |
34
|
Standard Performance Evaluation Corp., SPEC Newsletter, Fairfax, Va., Sept. 1995.
|
| |
35
|
R.M. Tomasulo. "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM J. Research and Development, vol. 11, no. 1, pp. 25-33, Jan. 1967,
|
 |
36
|
Nigel Topham , Alasdair Rawsthorne , Callum McLean , Muriel Mewissen , Peter Bird, Compiling and optimizing for decoupled architectures, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.40-es, December 04-08, 1995, San Diego, California, United States
[doi> 10.1145/224170.224301]
|
 |
37
|
Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , Jack L. Lo , Rebecca L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd annual international symposium on Computer architecture, p.191-202, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
38
|
|
 |
39
|
Gary Tyson , Matthew Farrens , Andrew R. Pleszkun, MISC: a Multiple Instruction Stream Computer, Proceedings of the 25th annual international symposium on Microarchitecture, p.193-196, December 01-04, 1992, Portland, Oregon, United States
|
 |
40
|
|
| |
41
|
|
 |
42
|
|
|