|
ABSTRACT
The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors targeted at commercial servers.
This paper describes the Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multi-processing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip. Piranha also integrates further on-chip functionality to allow for scalable multiprocessor configurations to be built in a glueless and modular fashion. The use of simple processor cores combined with an industry-standard ASIC design methodology allow us to complete our prototype within a short time-frame, with a team size and investment that are an order of magnitude smaller than that of a commercial microprocessor. Our detailed simulation results show that while each Piranha processor core is substantially slower than an aggressive next-generation processor, the integration of eight cores onto a single chip allows Piranha to outperform next-generation processors by up to 2.9 times (on a per chip basis) on important workloads such as OLTP. This performance advantage can approach a factor of five by using full-custom instead of ASIC logic. In addition to exploiting chip multiprocessing, the Piranha prototype incorporates several other unique design choices including a shared second-level cache with no inclusion, a highly optimized cache coherence protocol, and a novel I/O architecture.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
A. Agarwal , R. Simoni , J. Hennessy , M. Horowitz, An evaluation of directory schemes for cache coherence, Proceedings of the 15th Annual International Symposium on Computer architecture, p.280-298, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
| |
2
|
P. Bannon. Alpha 21364: A Scalable Single-chip SMP. Presented at the Microprocessor Forum '98 (http://www.digital.com/alphaoem/microprocessorforum.htm), October 1998.
|
| |
3
|
L.A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In 6th International Symposium on High-Performance Computer Architecture, pages 3-14, January 2000.
|
 |
4
|
|
| |
5
|
J. Borkenhagen and S. Storino. 5th Generation 64-bit PowerPC-Compatible Commercial Processor Design. http://www.rs6OOO.ibm.com /resource/technology/pulsar.pdf. September 1999.
|
| |
6
|
S. Crowder et al. IEDM Technical Digest, page 1017, 1998.
|
 |
7
|
|
| |
8
|
Z. Cvetanovic and D. Donaldson. AlphaServer 4100 Performance Characterization. In Digital Technical Journal, 8(4), pages 3-20, 1996.
|
| |
9
|
K. Diefendorff. Power4 Focuses on Memory Bandwidth: IBM Confronts IA-64, Says ISA Not Important. In Microprocessor Report, Vol. 13, No. 13, October 1999.
|
| |
10
|
Digital Equipment Corporation. Digital Semiconductor 21164 Alpha Microprocessor Hardware Reference Manual. March 1996.
|
| |
11
|
Susan J. Eggers , Joel S. Emer , Henry M. Levy , Jack L. Lo , Rebecca L. Stamm , Dean M. Tullsen, Simultaneous Multithreading: A Platform for Next-Generation Processors, IEEE Micro, v.17 n.5, p.12-19, September 1997
[doi> 10.1109/40.621209]
|
 |
12
|
Richard J. Eickemeyer , Ross E. Johnson , Steven R. Kunkel , Mark S. Squillante , Shiafun Liu, Evaluation of multithreaded uniprocessors for commercial application environments, Proceedings of the 23rd annual international symposium on Computer architecture, p.203-212, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
13
|
J.S. Emer. Simultaneous Multithreading: Multiplying Alpha's Performance. Presentation at the Microprocessor Forum '99, October 1999.
|
| |
14
|
A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, July 1990.
|
| |
15
|
|
 |
16
|
Lance Hammond , Mark Willey , Kunle Olukotun, Data speculation support for a chip multiprocessor, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.58-69, October 02-07, 1998, San Jose, California, United States
|
| |
17
|
L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Willey, M. Chen, M. Kozyrczak, and K. Olukotun. The Stanford Hydra CMP. Presented at Hot Chips 11, August 1999.
|
| |
18
|
|
| |
19
|
IBM Microelectronics. ASIC SA27E Databook. International Business Machines, 1999.
|
 |
20
|
|
 |
21
|
Kimberly Keeton , David A. Patterson , Yong Qiang He , Roger C. Raphael , Walter E. Baker, Performance characterization of a Quad Pentium Pro SMP using OLTP workloads, Proceedings of the 25th annual international symposium on Computer architecture, p.15-26, June 27-July 02, 1998, Barcelona, Spain
|
 |
22
|
|
| |
23
|
|
 |
24
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
25
|
|
 |
26
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Anoop Gupta , John Hennessy, The directory-based cache coherence protocol for the DASH multiprocessor, Proceedings of the 17th annual international symposium on Computer Architecture, p.148-159, May 28-31, 1990, Seattle, Washington, United States
|
 |
27
|
Jack L. Lo , Luiz André Barroso , Susan J. Eggers , Kourosh Gharachorloo , Henry M. Levy , Sujay S. Parekh, An analysis of database workload performance on simultaneous multithreaded processors, Proceedings of the 25th annual international symposium on Computer architecture, p.39-50, June 27-July 02, 1998, Barcelona, Spain
|
 |
28
|
Ann Marie Grizzaffi Maynard , Colette M. Donnelly , Bret R. Olszewski, Contrasting characteristics and cache performance of technical and multi-user commercial workloads, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.145-156, October 05-07, 1994, San Jose, California, United States
|
 |
29
|
Basem A. Nayfeh , Lance Hammond , Kunle Olukotun, Evaluation of design alternatives for a multiprocessor microprocessor, Proceedings of the 23rd annual international symposium on Computer architecture, p.67-77, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
30
|
Andreas G. Nowatzyk , Michael C. Browne , Edmund J. Kelly , Michael Parkin, S-connect: from networks of workstations to supercomputer performance, Proceedings of the 22nd annual international symposium on Computer architecture, p.71-82, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
31
|
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In International Conference on Parallel Processing (ICPP' 95), pages 1.1 - 1.10, July 1995.
|
| |
32
|
Andreas Nowatzyk , Gunes Aybay , Michael C. Browne , Edmund J. Kelly , Michael Parkin , Bill Radke , Sanjay Vishin, Exploiting Parallelism in Cache Coherency Protocol Engines, Proceedings of the First International Euro-Par Conference on Parallel Processing, p.269-286, August 29-31, 1995
|
 |
33
|
Kunle Olukotun , Basem A. Nayfeh , Lance Hammond , Ken Wilson , Kunyung Chang, The case for a single-chip multiprocessor, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.2-11, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
34
|
|
 |
35
|
Parthasarathy Ranganathan , Kourosh Gharachorloo , Sarita V. Adve , Luiz André Barroso, Performance of database workloads on shared-memory systems with out-of-order processors, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.307-318, October 02-07, 1998, San Jose, California, United States
|
 |
36
|
M. Rosenblum , E. Bugnion , S. A. Herrod , E. Witchel , A. Gupta, The impact of architectural trends on operating system performance, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.285-298, December 03-06, 1995, Copper Mountain, Colorado, United States
|
 |
37
|
|
 |
38
|
Ashley Saulsbury , Fong Pong , Andreas Nowatzyk, Missing the memory wall: the case for processor/memory integration, Proceedings of the 23rd annual international symposium on Computer architecture, p.90-101, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
39
|
|
| |
40
|
Standard Performance Council. The SPEC95 CPU Benchmark Suite. http ://www.specbench.org, 1995.
|
| |
41
|
|
 |
42
|
|
| |
43
|
Transaction Processing Performance Council. TPC Benchmark B Standard Specification Revision 2.0. June 1994.
|
| |
44
|
Transaction Processing Performance Council. TPC Benchmark D (Decision Support) Standard Specification Revision 1.2. November 1996.
|
| |
45
|
Transaction Processing Performance Council. TPC Benchmark C, Standard Specification Revision 3.6, October 1999.
|
| |
46
|
|
| |
47
|
M. Tremblay. MAJC-5200: A VLIW Convergent MPSOC. In Microprocessor Forum, October 1999.
|
 |
48
|
|
CITED BY 110
|
|
George S. Almasi , Călin Caşcaval , José G. Castaños , Monty Denneau , Wilm Donath , Maria Eleftheriou , Mark Giampapa , Howard Ho , Derek Lieber , José E. Moreira , Dennis Newns , Marc Snir , Henry S. Warren, Jr., Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer, International Journal of Parallel Programming, v.30 n.4, p.317-351, August 2002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Milo M. K. Martin , Daniel J. Sorin , Anastassia Ailamaki , Alaa R. Alameldeen , Ross M. Dickson , Carl J. Mauer , Kevin E. Moore , Manoj Plakal , Mark D. Hill , David H. Wood, Timestamp snooping: an approach for extending SMPs, ACM SIGPLAN Notices, v.35 n.11, p.25-36, Nov. 2000
|
|
|
George Almási , Cǎlin Caşcaval , José G. Castaños , Monty Denneau , Derek Lieber , José E. Moreira , Henry S. Warren, Jr., Dissecting Cyclops: a detailed analysis of a multithreaded architecture, ACM SIGARCH Computer Architecture News, v.31 n.1, March 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Nitya Ranganathan , Doug Burger , Stephen W. Keckler , Robert G. McDonald , Charles R. Moore, TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP, ACM Transactions on Architecture and Code Optimization (TACO), v.1 n.1, p.62-93, March 2004
|
|
|
Alex Ramirez , Luiz André Barroso , Kourosh Gharachorloo , Robert Cohn , Josep Larriba-Pey , P. Geoffrey Lowney , Mateo Valero, Code layout optimizations for transaction processing workloads, ACM SIGARCH Computer Architecture News, v.29 n.2, p.155-164, May 2001
|
|
|
Milo M. K. Martin , Daniel J. Sorin , Anatassia Ailamaki , Alaa R. Alameldeen , Ross M. Dickson , Carl J. Mauer , Kevin E. Moore , Manoj Plakal , Mark D. Hill , David A. Wood, Timestamp snooping: an approach for extending SMPs, ACM SIGARCH Computer Architecture News, v.28 n.5, p.25-36, Dec. 2000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Timothy Sherwood , Mark Oskin , Brad Calder, Balancing design options with Sherpa, Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, September 22-25, 2004, Washington DC, USA
|
|
|
|
|
|
|
|
|
|
|
|
Milo M. K. Martin , Daniel J. Sorin , Harold W. Cain , Mark D. Hill , Mikko H. Lipasti, Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing, Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, December 01-05, 2001, Austin, Texas
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Nikolaos Hardavellas , Stephen Somogyi , Thomas F. Wenisch , Roland E. Wunderlich , Shelley Chen , Jangwoo Kim , Babak Falsafi , James C. Hoe , Andreas G. Nowatzyk, SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture, ACM SIGMETRICS Performance Evaluation Review, v.31 n.4, p.31-34, March 2004
|
|
|
Charles Lefurgy , Karthick Rajamani , Freeman Rawson , Wes Felter , Michael Kistler , Tom W. Keller, Energy Management for Commercial Servers, Computer, v.36 n.12, p.39-48, December 2003
|
|
|
|
|
|
|
|
|
|
|
|
Jaehyuk Huh , Changkyu Kim , Hazim Shafi , Lixin Zhang , Doug Burger , Stephen W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Doug Burger , Stephen W. Keckler , Charles R. Moore, Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture, ACM SIGARCH Computer Architecture News, v.31 n.2, May 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Taeho Kgil , Shaun D'Souza , Ali Saidi , Nathan Binkert , Ronald Dreslinski , Trevor Mudge , Steven Reinhardt , Krisztian Flautner, PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor, ACM SIGPLAN Notices, v.41 n.11, November 2006
|
|
|
|
|
|
Jung Ho Ahn , Mattan Erez , William J. Dally, Tradeoff between data-, instruction-, and thread-level parallelism in stream processors, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
|
|
|
|
|
|
|
|
|
|
|
|
Håkan Zeffer , Zoran Radović , Martin Karlsson , Erik Hagersten, TMA: a trap-based memory architecture, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
|
|
|
Dan Wallin , Henrik Löf , Erik Hagersten , Sverker Holmgren, Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
|
|
|
|
|
|
|
|
|
Shailender Chaudhry , Robert Cypher , Magnus Ekman , Martin Karlsson , Anders Landin , Sherman Yip , Håkan Zeffer , Marc Tremblay, Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor, ACM SIGARCH Computer Architecture News, v.37 n.3, June 2009
|
|
|
Steven Swanson , Andrew Schwerin , Martha Mercaldi , Andrew Petersen , Andrew Putnam , Ken Michelson , Mark Oskin , Susan J. Eggers, The WaveScalar architecture, ACM Transactions on Computer Systems (TOCS), v.25 n.2, p.4-es, May 2007
|
|
|
Guy E. Blelloch , Rezaul A. Chowdhury , Phillip B. Gibbons , Vijaya Ramachandran , Shimin Chen , Michael Kozuch, Provably good multicore cache performance for divide-and-conquer algorithms, Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, p.501-510, January 20-22, 2008, San Francisco, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Richard A. Hankins , Trung Diep , Murali Annavaram , Brian Hirano , Harald Eri , Hubert Nueckel , John P. Shen, Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice, Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.151, December 03-05, 2003
|
|
|
|
|
|
Chritophe Bobda , Thomas Haller , Felix Muehlbauer , Dennis Rech , Simon Jung, Design of adaptive multiprocessor on chip systems, Proceedings of the 20th annual conference on Integrated circuits and systems design, September 03-06, 2007, Copacabana, Rio de Janeiro
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ana Bosque , Pablo Ibañez , Víctor Viñals , Per Stenström , Jose M. Llabería, Characterization of Apache web server with Specweb2005, Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture, p.65-72, September 16-16, 2007, Brasov, Romania
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacob Leverich , Hideho Arakida , Alex Solomatnikov , Amin Firoozshahian , Mark Horowitz , Christos Kozyrakis, Comparative evaluation of memory models for chip multiprocessors, ACM Transactions on Architecture and Code Optimization (TACO), v.5 n.3, p.1-30, November 2008
|
|
|
|
|
|
|
|
|
|
|
|
Christos D. Antonopoulos , Filip Blagojevic , Andrey N. Chernikov , Nikos P. Chrisochoides , Dimitrios S. Nikolopoulos, A multigrain Delaunay mesh generation method for multicore SMT-based architectures, Journal of Parallel and Distributed Computing, v.69 n.7, p.589-600, July, 2009
|
|
|
|
|
|
Eric S. Chung , Michael K. Papamichael , Eriko Nurvitadhi , James C. Hoe , Ken Mai , Babak Falsafi, ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs, ACM Transactions on Reconfigurable Technology and Systems (TRETS), v.2 n.2, p.1-32, June 2009
|
|
|
Carlos Madriles , Pedro López , Josep M. Codina , Enric Gibert , Fernando Latorre , Alejandro Martinez , Raúl Martinez , Antonio Gonzalez, Boosting single-thread performance in multi-core systems through fine-grain multi-threading, ACM SIGARCH Computer Architecture News, v.37 n.3, June 2009
|
|
|
|
|
|
|
|