|
ABSTRACT
High performance superscalar architectures used to exploit instruction level parallelism in single-thread applications have become too complex and too power hungry for the many-core processors era. We propose a new architecture that uses multiple latency-tolerant in-order cores to improve single-thread performance, without requiring complex out-of-order execution hardware or large, power hungry register files and instruction buffers. Using simple cores to provide improved single-thread performance for conventional difficult-to-parallelize applications allows designers to place many more of these cores on the same die. Consequently, emerging highly parallel applications can take full advantage of the many-core parallel hardware without sacrificing performance of inherently serial applications. Our architecture splits single-thread program execution into disjoint control and data threads that execute concurrently on multiple latency-tolerant in-order cores. Hence we call this style of execution Disjoint Out-of-Order Execution (DOE). DOE is a novel implementation of Speculative Multithreading (SpMT). It uses latency tolerance to overcome performance issues of SpMT caused by load imbalance and inter-thread data communication delays. Using control independence prediction hardware to spawn threads, we simulate the potential performance of DOE on a subset of Spec2000 integer benchmarks under various parallelism scenarios and for DOE configurations of 2, 4, 6 and 8 single-issue latency tolerant cores.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
H. Akkary and M. Driscoll. A Dynamic Multithreading Processor. MICRO-31, November 1998.
|
| |
3
|
|
 |
4
|
Ahmed S. Al-Zawawi , Vimal K. Reddy , Eric Rotenberg , Haitham H. Akkary, Transparent control independence (TCI), Proceedings of the 34th annual international symposium on Computer architecture, June 09-13, 2007, San Diego, California, USA
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
A. Cristal, M. Valero, J. Llosa, and A. Gonzalez. Large Virtual ROBs by Processor Checkpointing. Tech. Report, UPC-DAC-2002-39, Department of Computer Science, Barcelona, Spain, July 2002.
|
| |
10
|
Adrian Cristal , Oliverio J. Santana , Francisco Cazorla , Marco Galluzzi , Tanausu Ramirez , Miquel Pericas , Mateo Valero, Kilo-Instruction Processors: Overcoming the Memory Wall, IEEE Micro, v.25 n.3, p.48-57, May 2005
[doi> 10.1109/MM.2005.53]
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
J. Held, J. Bautista, and S. Koehl. From a Few Cores to Many: A Tera-Scale Computing Research Review. Intel Corporation, Research White Paper. http://download.intel.com/research/platform/terascale/terascale_overview_paper.pdf.
|
| |
16
|
|
| |
17
|
|
| |
18
|
M. S. Lam and R. P. Wilson. Limits of Control Flow on Parallelism. ISCA-19, May 1992.
|
 |
19
|
Alvin R. Lebeck , Jinson Koppanalil , Tong Li , Jaidev Patwardhan , Eric Rotenberg, A large, fast instruction window for tolerating cache misses, Proceedings of the 29th annual international symposium on Computer architecture, May 25-29, 2002, Anchorage, Alaska
|
| |
20
|
P. Marcuello. Speculative Multithreaded Processors, Ph.D. Dissertation, Universitat Politecnica de Catalunya, 2003.
|
| |
21
|
|
| |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
José F. Martínez , Jose Renau , Michael C. Huang , Milos Prvulovic , Josep Torrellas, Cherry: checkpointed early resource recycling in out-of-order microprocessors, Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, November 18-22, 2002, Istanbul, Turkey
|
 |
26
|
Andreas Moshovos , Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th annual international symposium on Computer architecture, p.181-193, June 01-04, 1997, Denver, Colorado, United States
|
| |
27
|
S. Nekkalapu, H. Akkary, K. Jothi, R. Retnamma, and X. Song. A Simple Latency Tolerant Processor. In Proceedings of the IEEE 26th International Conference on Computer Design, October 2008.
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
 |
31
|
|
 |
32
|
Carlos García Quiñones , Carlos Madriles , Jesús Sánchez , Pedro Marcuello , Antonio González , Dean M. Tullsen, Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices, Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, June 12-15, 2005, Chicago, IL, USA
|
| |
33
|
|
| |
34
|
Eric Rotenberg , Quinn Jacobson , Yiannakis Sazeides , Jim Smith, Trace processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.138-148, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
35
|
|
| |
36
|
|
 |
37
|
Srikanth T. Srinivasan , Ravi Rajwar , Haitham Akkary , Amit Gandhi , Mike Upton, Continual flow pipelines, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
 |
38
|
J. Greggory Steffan , Christopher B. Colohan , Antonia Zhai , Todd C. Mowry, A scalable approach to thread-level speculation, Proceedings of the 27th annual international symposium on Computer architecture, p.1-12, June 2000, Vancouver, British Columbia, Canada
|
| |
39
|
|
| |
40
|
|
| |
41
|
|
| |
42
|
|
| |
43
|
X86 Cycle Accurate Processor Simulation Design Infrastructure. http://www.ptlsim.org/
|
 |
44
|
|
|