|
ABSTRACT
This paper is motivated by three recent trends in computer design. First, chip multi-processors (CMPs) with increasing numbers of CPU cores per chip are becoming common. Second, multi-threaded software that can take advantage of CMPs will soon become prevalent. Due to the nature of the algorithms, these multi-threaded programs inherently will have phases of sequential execution; Amdahlýs law dictates that the speedup of such parallel programs will be limited by the sequential portion of the computation. Finally, increasing levels of on-chip integration coupled with a slowing rate of reduction in supply voltage make power consumption a first order design constraint. Given this environment, our goal is to minimize the execution times of multi-threaded programs containing nontrivial parallel and sequential phases, while keeping the CMPýs total power consumption within a fixed budget. In order to mitigate the effects of Amdahlýs law, in this paper we make a compelling case for varying the amount of energy expended to process instructions according to the amount of available parallelism. Using the equation, Power=Energy per instruction (EPI) * Instructions per second (IPS), we propose that during phases of limited parallelism (low IPS) the chip multi-processor will spend more EPI; similarly, during phases of higher parallelism (high IPS) the chip multi-processor will spend less EPI; in both scenarios power is fixed. We evaluate the performance benefits of an EPI throttle on an asymmetric multiprocessor (AMP) prototyped from a physical 4-way Xeon SMP server. Using a wide range of multi-threaded programs, we show a 38% wall clock speedup on an AMP compared to a standard SMP that uses the same power. We also measure the supply current on a 4-way SMP server while running the multi-threaded programs and use the measured data as input to a software simulator that implements a more flexible EPI throttle. The results from the measurement-driven simulation show performance benefits comparable to the AMP prototype. We analyze the results from both techniques, explain why and when an EPI throttle works well, and conclude with a discussion of the challenges in building practical EPI throttles.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
David H. Albonesi , Rajeev Balasubramonian , Steven G. Dropsho , Sandhya Dwarkadas , Eby G. Friedman , Michael C. Huang , Volkan Kursun , Grigorios Magklis , Michael L. Scott , Greg Semeraro , Pradip Bose , Alper Buyuktosunoglu , Peter W. Cook , Stanley E. Schuster, Dynamically Tuning Processor Resources with Adaptive Processing, Computer, v.36 n.12, p.49-58, December 2003
[doi> 10.1109/MC.2003.1250883]
|
| |
2
|
[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic local alignment search tool. In Journal of Molecular Biology, vol. 215, pages 403- 410, 1990.
|
| |
3
|
Vishal Aslot , Max J. Domeika , Rudolf Eigenmann , Greg Gaertner , Wesley B. Jones , Bodo Parady, SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance, Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming, p.1-10, July 30-31, 2001
|
 |
4
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
5
|
|
| |
6
|
|
| |
7
|
[7] FFTW: http://www.fftw.org
|
| |
8
|
[8] R. J. O. Figueiredo and J. A. B. Fortes. Impact of heterogeneity on DSM performance. In Proceedings Sixth International Symposium on High-Performance Computer Architecture, pages 26-38, January 2000.
|
| |
9
|
|
| |
10
|
[10] S. H. Gunther, F. Binns, D. M. Carmean, J. C. Hall. Managing the Impact of Increasing Microprocessor Power Consumption. Intel Technology Journal, First Quarter 2001. http://www.intel.com/technology/itj/q12001.htm
|
| |
11
|
[11] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Willey, M. Chen, M. Kozyrczak, and K. Olukotun. The Stanford Hydra CMP. Hot Chips 11, August 1999.
|
| |
12
|
[12] HMMER: http://hmmer.wustl.edu
|
| |
13
|
[13] Intel® Pentium® 4 Processor in the 423-pin Package at 1.30 GHz, 1.40 GHz, 1.50 GHz, 1.60 GHz, 1.70 GHz and 1.80 GHz Datasheet,. http://support.intel.com/design/pentium4/datashts/24 9198.htm, pages 78-79, 2001.
|
| |
14
|
|
| |
15
|
[15] J. Kahle. Power4: A Dual-CPU Processor Chip. Microprocessor Forum '99, October 1999.
|
| |
16
|
|
 |
17
|
Rakesh Kumar , Dean M. Tullsen , Parthasarathy Ranganathan , Norman P. Jouppi , Keith I. Farkas, Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance, Proceedings of the 31st annual international symposium on Computer architecture, p.64, June 19-23, 2004, München, Germany
|
| |
18
|
[18] J. Li and J. F. Martínez. Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors. To appear in Proceedings of the International Symposium on Performance Analysis of Systems and Software, March. 2005.
|
 |
19
|
|
| |
20
|
|
| |
21
|
[21] T. Y. Morad, U. Weiser and A. Kolodny. ACCMP - Asymmetric Chip Multi-Processing. CCIT Technical Report #488, http://www.ee.technion.ac.il/morad/publications/acc mptr.pdf, June 2004.
|
 |
22
|
|
| |
23
|
[23] TPC-H: http://www.tpc.org/tpch
|
| |
24
|
[24] J. Tschanz, S. Narendra, Y. Yibin, B. Bloechel, S. Borkar, D, Vivek. Dynamic-sleep transistor and body bias for active leakage power control of microprocessors. In IEEE Journal of Solid-State Circuits, 38(11):1838-1845, November 2003.
|
CITED BY 20
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Divya P. Gulati , Changkyu Kim , Simha Sethumadhavan , Stephen W. Keckler , Doug Burger, Multitasking workload scheduling on flexible-core chip multiprocessors, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
|
|
|
Henry Wong , Anne Bracy , Ethan Schuchman , Tor M. Aamodt , Jamison D. Collins , Perry H. Wang , Gautham Chinya , Ankur Khandelwal Groen , Hong Jiang , Hong Wang, Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
|
|
|
Sriram Govindan , Jeonghwan Choi , Bhuvan Urgaonkar , Anand Sivasubramaniam , Andrea Baldini, Statistical profiling-based techniques for effective power provisioning in data centers, Proceedings of the fourth ACM european conference on Computer systems, April 01-03, 2009, Nuremberg, Germany
|
|