ACM Home Page
Please provide us with feedback. Feedback
Improving Latency Tolerance of Multithreading through Decoupling
Full text Publisher SitePublisher Site
Source IEEE Transactions on Computers archive
Volume 50 ,  Issue 10  (October 2001) table of contents
Pages: 1084 - 1094  
Year of Publication: 2001
ISSN:0018-9340
Authors
Joan-Manuel Parcerisa  Univ. Politècnica de Catalunya, Barcelona, Spain
Antonio Gonzalez  Univ. Politècnica de Catalunya, Barcelona, Spain
Publisher
IEEE Computer Society  Washington, DC, USA
Bibliometrics
Downloads (6 Weeks): n/a,   Downloads (12 Months): n/a,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.1109/12.956093

ABSTRACT

The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. This work presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in-order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. Our study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. Since one of the problems of multithreading is the degradation of the memory system performance, both in terms of miss latency and bandwidth requirements, this improvement becomes critical for high miss latencies, where bandwidth might become a bottleneck. Finally, although it may seem rather surprising, our study reveals that multithreading by itself exhibits little memory latency tolerance. Our results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Berrached, L.D. Coraor, and PT. Hulina, "A Decoupled Access/Execute Architecture for Efficient Access of Structured Data," Proc. 26th Hawai Int'l Conf. System Sciences, vol. 1, pp. 438447, Jan. 1993.
2
 
3
 
4
G.E. Daddis and H.C. Torng, "The Concurrent Execution of Multiple Execution Streams on Superscalar Processors," Proc. Int'l Conf. Parallel Processing, pp. 76-83, Aug. 1991.
 
5
Digital Equipment Corp., Alpha 21164 Microprocessor Hardware Reference Manual, Or. Num. EC-QAEQB-TE, Maynard, Mass., Apr. 1995.
 
6
7
 
8
L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Design," Microprocessor Report, vol. 9, no. 2, pp. 9-15, Feb. 1995.
 
9
L Gwennap, 'Digital 21264 Sets New Standard," Microprocessor Report, vol. 10, no. 14, Oct. 1996.
10
 
11
 
12
M. Johnson, Superscalar Microprocessor Design. Englewood Cliffs, N.J.: Prentice Hall, 1991.
 
13
14
 
15
G.A. Kemp and lvi Franklin," PEWs: A Decentralized Dynamic Scheduler for ILP Processing," Proc. Intl Con,! Parallel Processing, vol. 1, pp. 239-246, 1996.
 
16
 
17
A. Kumar, "The HP-PA8000 RISC CPU: A High Performance Outof-Order Processor," Proc. Hot Chips VIII, pp. 9-20, Aug. 1996.
 
18
 
19
 
20
21
 
22
 
23
AR. Pleszkun and E.S. Davidson, "Structured Memory Access Architecture," Proc. 1983 Int'l Conf. Parallel Processing, pp. 461-471, Aug. 1983.
 
24
25
 
26
27
28
29
 
30
31
32
33
 
34
Standard Performance Evaluation Corp., SPEC Newsletter, Fairfax, Va., Sept. 1995.
 
35
R.M. Tomasulo. "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM J. Research and Development, vol. 11, no. 1, pp. 25-33, Jan. 1967,
36
37
38
39
40
 
41
42

Collaborative Colleagues:
Joan-Manuel Parcerisa: colleagues
Antonio Gonzalez: colleagues