ACM Home Page
Please provide us with feedback. Feedback
An L2-miss-driven early register deallocation for SMT processors
Full text PdfPdf (500 KB)
Source
International Conference on Supercomputing archive
Proceedings of the 21st annual international conference on Supercomputing table of contents
Seattle, Washington
SESSION: Architecture -- processor table of contents
Pages: 138 - 147  
Year of Publication: 2007
ISBN:978-1-59593-768-1
Authors
Joseph Sharkey  State University of New York, Binghamton, NY
Dmitry Ponomarev  State University of New York, Binghamton, NY
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 34,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1274971.1274992
What is a DOI?

ABSTRACT

The register file is one of the most critical datapath components limiting the number of threads that can be supported on a Simultaneous Multithreading (SMT) processor. To allow the use of smaller register files without degrading performance, techniques that maximize the efficiency of using registers through aggressive register allocation/deallocation can be considered. In this paper, we propose a novel technique to early deallocate physical registers allocated to threads that experience L2 cache misses. This is accomplished by speculatively committing the load-independent instructions and deallocating the registers corresponding to the previous mappings of their destinations, without waiting for the cache miss request to be serviced. The early deallocated registers are then made immediately available for allocation to instructions within the same thread as well as within other threads, thus improving the overall processor throughput. On the average across the simulated mixes of multiprogrammed SPEC 2000 workloads, our technique results in 33% improvement in throughput and 25% improvement in terms of harmonic mean of weighted IPCs over the baseline SMT with the state-of-the-art DCRA policy. This is achieved without creating checkpoints, maintaining per-register counters of pending consumers, performing tag re-broadcasts, register re-mappings and/or additional associative searches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
D. Burger, T. Austin. "The SimpleScalar tool set: Version 2.0." Tech. Report, Dept. of CS, Univ. of Wisconsin-Madison, June 1997 and documentation for all Simplescalar releases.
 
5
 
6
F. Cazorla, et al. "Improving Memory Latency Aware Fetch Policies for SMT Processors," HiPC, 2003.
7
 
8
 
9
 
10
 
11
12
 
13
K. Luo, et al. "Balancing Throughput and Fairness in SMT Processors," Int'l Symposium Perf Analysis of Systems and Software 2001.
 
14
 
15
 
16
 
17
 
18
D. Marr, et al, "Hyperthreading Technology Architecture and Microarchitecture", Intel Tech. J., vol. 6, No.1, Feb 2002.
 
19
 
20
J. Sharkey. "M-Sim: A Flexible, Multi-threaded Simulation Environment." http://www.cs.binghamton.edu/~jsharke/m-sim
21
22
23
 
24
25
26
 
27
 
28
H. De Vries, "Understanding the Detailed Architecture of AMD's 64-bit Core", available at: http://www.chiparchitect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
 
29
J. Sharkey, et al, "An L2-Miss-Driven Early Register Deallocation for SMT Processors", Tech Report, SUNY Binghamton, at: http://caps.cs.binghamton.edu/ICS07_benchmarks.html
 
30
31
32


Collaborative Colleagues:
Joseph Sharkey: colleagues
Dmitry Ponomarev: colleagues