|
ABSTRACT
The register file is one of the most critical datapath components limiting the number of threads that can be supported on a Simultaneous Multithreading (SMT) processor. To allow the use of smaller register files without degrading performance, techniques that maximize the efficiency of using registers through aggressive register allocation/deallocation can be considered. In this paper, we propose a novel technique to early deallocate physical registers allocated to threads that experience L2 cache misses. This is accomplished by speculatively committing the load-independent instructions and deallocating the registers corresponding to the previous mappings of their destinations, without waiting for the cache miss request to be serviced. The early deallocated registers are then made immediately available for allocation to instructions within the same thread as well as within other threads, thus improving the overall processor throughput. On the average across the simulated mixes of multiprogrammed SPEC 2000 workloads, our technique results in 33% improvement in throughput and 25% improvement in terms of harmonic mean of weighted IPCs over the baseline SMT with the state-of-the-art DCRA policy. This is achieved without creating checkpoints, maintaining per-register counters of pending consumers, performing tag re-broadcasts, register re-mappings and/or additional associative searches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
D. Burger, T. Austin. "The SimpleScalar tool set: Version 2.0." Tech. Report, Dept. of CS, Univ. of Wisconsin-Madison, June 1997 and documentation for all Simplescalar releases.
|
| |
5
|
Francisco J. Cazorla , Alex Ramirez , Mateo Valero , Enrique Fernandez, Dynamically Controlled Resource Allocation in SMT Processors, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.171-182, December 04-08, 2004, Portland, Oregon
[doi> 10.1109/MICRO.2004.17]
|
| |
6
|
F. Cazorla, et al. "Improving Memory Latency Aware Fetch Policies for SMT Processors," HiPC, 2003.
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
K. Luo, et al. "Balancing Throughput and Fairness in SMT Processors," Int'l Symposium Perf Analysis of Systems and Software 2001.
|
| |
14
|
José F. Martínez , Jose Renau , Michael C. Huang , Milos Prvulovic , Josep Torrellas, Cherry: checkpointed early resource recycling in out-of-order microprocessors, Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, November 18-22, 2002, Istanbul, Turkey
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
D. Marr, et al, "Hyperthreading Technology Architecture and Microarchitecture", Intel Tech. J., vol. 6, No.1, Feb 2002.
|
| |
19
|
Smruti R. Sarangi , Wei Liu, Josep Torrellas , Yuanyuan Zhou, ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.257-270, November 12-16, 2005, Barcelona, Spain
[doi> 10.1109/MICRO.2005.28]
|
| |
20
|
J. Sharkey. "M-Sim: A Flexible, Multi-threaded Simulation Environment." http://www.cs.binghamton.edu/~jsharke/m-sim
|
 |
21
|
|
 |
22
|
|
 |
23
|
Srikanth T. Srinivasan , Ravi Rajwar , Haitham Akkary , Amit Gandhi , Mike Upton, Continual flow pipelines, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
| |
24
|
|
 |
25
|
Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , Jack L. Lo , Rebecca L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd annual international symposium on Computer architecture, p.191-202, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
26
|
|
| |
27
|
|
| |
28
|
H. De Vries, "Understanding the Detailed Architecture of AMD's 64-bit Core", available at: http://www.chiparchitect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
|
| |
29
|
J. Sharkey, et al, "An L2-Miss-Driven Early Register Deallocation for SMT Processors", Tech Report, SUNY Binghamton, at: http://caps.cs.binghamton.edu/ICS07_benchmarks.html
|
| |
30
|
David W. Oehmke , Nathan L. Binkert , Trevor Mudge , Steven K. Reinhardt, How to Fake 1000 Registers, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.7-18, November 12-16, 2005, Barcelona, Spain
[doi> 10.1109/MICRO.2005.21]
|
 |
31
|
Deniz Balkan , Joseph Sharkey , Dmitry Ponomarev , Kanad Ghose, Selective writeback: exploiting transient values for energy-efficiency and performance, Proceedings of the 2006 international symposium on Low power electronics and design, October 04-06, 2006, Tegernsee, Bavaria, Germany
[doi> 10.1145/1165573.1165584]
|
 |
32
|
Deniz Balkan , Joseph Sharkey , Dmitry Ponomarev , Kanad Ghose, SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency, Proceedings of the 15th international conference on Parallel architectures and compilation techniques, September 16-20, 2006, Seattle, Washington, USA
[doi> 10.1145/1152154.1152194]
|
|