|
ABSTRACT
Technology scaling trends have forced designers to consider alternatives to deeply pipelining aggressive cores with large amounts of performance accelerating hardware. One alternative is a small, simple core that can be augmented with latency tolerant helpers. As the demands placed on the processor core varies between applications, and even between phases of an application, the benefit seen from any set of helpers will vary tremendously. If there is a single core, these auxiliary structures can be turned on and off dynamically to tune the energy/performance of the machine to the needs of the running application.As more of the processor is broken down into helpers, and additional cores are added to a single chip that can potentially share helpers, the decisions that are made about these structures become increasingly important. In this paper we describe the need for methods that effectively manage these helpers. Our counter-based approach can dynamically turn off three helpers on average while staying within 2% of the performance when running with all helpers. In a multicore environment, our intelligent and exible sharing of helper provides an average 24% speedup compared to static sharing in conjoined cores. Furthermore we show a benefit from constructively sharing helpers among multiple cores running the same application.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
D. C. Burger and T. M. Austin. The simplescalar tool set, version 2. 0. Technical Report CS-TR-97-1342, U. of Wisconsin, Madison, June 1997.
|
| |
4
|
R. Dolbeau and A. Seznec. Cash:Revisiting hardware sharing in single-chip parallel processor. Technical Report IRISA Report 1491, IRISA, November 2002.
|
| |
5
|
|
 |
6
|
|
| |
7
|
Johnson Kin , Munish Gupta , William H. Mangione-Smith, The filter cache: an energy efficient memory structure, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.184-193, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
E. Kursun, A. Shayesteh, S. Sair, T. Sherwood, and G. Reinman. An evaluation of deeply decoupled cores. In Journal of Instruction Level Parallelism volume 8, 2006.
|
 |
13
|
Glenn Reinman , Todd Austin , Brad Calder, A scalable front-end architecture for fast instruction delivery, Proceedings of the 26th annual international symposium on Computer architecture, p.234-245, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
14
|
|
 |
15
|
|
| |
16
|
P. Shivakumar and Norman P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. In Technical Report 2001.
|
| |
17
|
|
 |
18
|
|
 |
19
|
Srikanth T. Srinivasan , Roy Dz-ching Ju , Alvin R. Lebeck , Chris Wilkerson, Locality vs. criticality, Proceedings of the 28th annual international symposium on Computer architecture, p.132-143, June 30-July 04, 2001, Göteborg, Sweden
|
| |
20
|
Jared Stark , Paul Racunas , Yale N. Patt, Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.34-43, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
21
|
|
 |
22
|
Perry H. Wang , Jamison D. Collins , Hong Wang , Dongkeun Kim , Bill Greene , Kai-Ming Chan , Aamir B. Yunus , Terry Sych , Stephen F. Moore , John P. Shen, Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
 |
23
|
|
|