|
ABSTRACT
Multiprocessor Systems on Chips (MPSoCs) have become a popular architectural technique to increase performance. However, MPSoCs may lead to undesirable power consumption characteristics for computing systems that have strict power budgets, such as PDAs, mobile phones, and notebook computers. This paper presents the super-complex instruction-set computing (SuperCISC) Embedded Processor Architecture and, in particular, investigates performance and power consumption of this device compared to traditional processor architecture-based execution. SuperCISC is a heterogeneous, multicore processor architecture designed to exceed performance of traditional embedded processors while maintaining a reduced power budget compared to low-power embedded processors. At the heart of the SuperCISC processor is a multicore VLIW (Very Large Instruction Word) containing several homogeneous execution cores/functional units. In addition, complex and heterogeneous combinational hardware function cores are tightly integrated to the core VLIW engine providing an opportunity for improved performance and reduced energy consumption. Our SuperCISC processor core has been synthesized for both a 90-nm Stratix II Field Programmable Gate Aray (FPGA) and a 160-nm standard cell Application-Specific Integrated Circuit (ASIC) fabrication process from OKI, each operating at approximately 167 MHz for the VLIW core. We examine several reasons for speedup and power improvement through the SuperCISC architecture, including predicated control flow, cycle compression, and a reduction in arithmetic power consumption, which we call power compression. Finally, testing our SuperCISC processor with multimedia and signal-processing benchmarks, we show how the SuperCISC processor can provide performance improvements ranging from 7X to 160X with an average of 60X, while also providing orders of magnitude of power improvements for the computational kernels. The power improvements for our benchmark kernels range from just over 40X to over 400X, with an average savings exceeding 130X. By combining these power and performance improvements, our total energy improvements all exceed 1000X. As these savings are limited to the computational kernels of the applications, which often consume approximately 90% of the execution time, we expect our savings to approach the ideal application improvement of 10X.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
P. Banerjee , N. Shenoy , A. Choudhary , S. Hauck , C. Bachmann , M. Haldar , P. Joisha , A. Jones , A. Kanhare , A. Nayak , S. Periyacheri , M. Walkden , D. Zaretsky, A MATLAB Compiler for Distributed, Heterogeneous, Reconfigurable Computing Systems, Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines, p.39, April 17-19, 2000
|
| |
2
|
Banerjee, P., Haldar, M., Nayak, A., Kim, V., Saxena, V., Parkes, S., Bagchi, D., Pal, S., Tripathi, N., Zaretsky, D., Anderson, R., and Uribe, J. 2004. Overview of a compiler for synthesizing matlab programs onto fpgas. IEEE Transactions on Very large Scale Integration (VLSI) Systems.
|
 |
3
|
Luca Benini , Alberto Macii , Enrico Macii , Massimo Poncino, Selective instruction compression for memory energy reduction in embedded systems, Proceedings of the 1999 international symposium on Low power electronics and design, p.206-211, August 16-17, 1999, San Diego, California, United States
[doi> 10.1145/313817.313927]
|
| |
4
|
|
| |
5
|
|
| |
6
|
Chandrakasan, A., Sheng, S., and Brodersen, R. 1992. Low-power cmos digital design. JSSC 27, 4, 473--484.
|
| |
7
|
|
 |
8
|
|
| |
9
|
Cousin, J.-G., Sentieys, O., and Chillet, D. 2000. Multi-algorithm asip synthesis and power estimation for dsp applications. In Proceedings of ISCAS.
|
| |
10
|
CoWare. The lisatek solution: Automated embedded processor design and software development tool generation. Datasheet, CoWare, Inc.
|
| |
11
|
Dutta, S., Wolfe, A., Wolf, W., and O'Connor, K. 1996. Design issues for very-long-instruction-word vlsi video signal processors. In IEEE Workshop on VLSI Signal Processing.
|
| |
12
|
|
 |
13
|
|
| |
14
|
Georing, R. 2000. Synopsys launches power tool. EETimes.
|
| |
15
|
Glokler, C. and Meyr, H. 2001. Power reduction for asips: A case study. In Proceedings of the Wkshp. Signal Processing Systems (SIPS).
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
Gupta, S., Gupta, R., Dutt, N., and Nicolau, A. 2004. SPARK: : A Parallelizing Approach to the High-Level Synthesis of Digital Circuits. Kluwer Academic Publishers, Boston, MA.
|
| |
20
|
|
| |
21
|
Hoare, R., Tung, S., and Werger, K. 2003. A 64-way simd processing architecture on an fpga. In IASTED International Conference on Parallel and Distributed Computing and Systems.
|
| |
22
|
Hoare, R., Tung, S., and Werger, K. 2004. An 88-way multiprocessor within an fpga with customizable instructions. In International Parallel and Distributed Processing Symposium (IPDPS).
|
| |
23
|
Hoare, R., Jones, A. K., Kusic, D., Fazekas, J., Foster, J., Tung, S., and McCloud, M. 2005. Rapid vliw processor customization for signal processing applications using combinational hardware functions. EURASIP Journal on Applied Signal Processing.
|
 |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
Jones, A. K., Bagchi, D., Pal, S., Banerjee, P., and Choudhary, A. 2002. Pact HDL: Compiler Targeting ASIC's and FPGA's with Power and Performance Optimizations. Kluwer Academic Publishers, Boston, MA.
|
| |
28
|
Jones, A., Hoare, R., Kourtev, I., Fazekas, J., Kusic, D., Foster, J., Boddie, S., and Muaydh, A. 2004. A 64way vliw/simd fpga processing architecture and design flow. In IEEE International Conference on Electronics, Circuits, and Systems (ICECS).
|
 |
29
|
Alex K. Jones , Raymond Hoare , Dara Kusic , Joshua Fazekas , John Foster, An FPGA-based VLIW processor with custom hardware execution, Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, February 20-22, 2005, Monterey, California, USA
[doi> 10.1145/1046192.1046207]
|
| |
30
|
Brucek Khailany , William J. Dally , Ujval J. Kapasi , Peter Mattson , Jinyung Namkoong , John D. Owens , Brian Towles , Andrew Chang , Scott Rixner, Imagine: Media Processing with Streams, IEEE Micro, v.21 n.2, p.35-46, March 2001
[doi> 10.1109/40.918001]
|
| |
31
|
|
| |
32
|
K. S. Khouri , G. Lakshminarayana , N. K. Jha, IMPACT: a high-level synthesis system for low power control-flow intensive circuits, Proceedings of the conference on Design, automation and test in Europe, p.848-854, February 23-26, 1998, Le Palais des Congrés de Paris, France
|
| |
33
|
Chunho Lee , Miodrag Potkonjak , William H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.330-335, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
34
|
Levine, B. and Schmit, H. 2002. Piperench: Power & performance evaluation of a programmable pipelined datapath. presented at Hot Chips 14, Palo Alto, CA.
|
| |
35
|
|
| |
36
|
|
| |
37
|
Liu, X. and Papaefthymiou, M. C. 2004. A markov chain sequence generator for power macromodeling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD).
|
| |
38
|
McCloud, S. 2004. Catapult c synthesis-based design flow: Speeding implementation and increasing flexibility. Tech. rep., Mentor Graphics.
|
| |
39
|
Mehta, G., Jones, A. K., and Hoare, R. 2005. An energy-efficient coarse-grained reconfigurable fabric arch itecture. Tech. Rep. TR-ECE-2005-07-001, University of Pittsburgh, Department of Electrical and Computer Engineering. July.
|
| |
40
|
Mirsky, E. and Dehon, A. 1996. Matrix: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. In in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines.
|
 |
41
|
|
| |
42
|
|
| |
43
|
Nene, A., Talla, S., Goldberg, B., Kim, H., and Rabbah, R. M. 1998. Trimaran: An infrastructure for compiler research in instruction level parallelism.
|
| |
44
|
|
| |
45
|
Roy, K. and Prasad, S. 2000. Low-Power CMOS VLSI Design. Wiley, New York.
|
| |
46
|
Schmit, H., Whelihan, D., Tsai, A., Moe, M., Levine, B., and Taylor, R. R. 2002. Piperench: A virtualized programmable datapath in 0.18 micron technolog. In Proceedings of the IEEE Custom Integrated Circuits Conference.
|
| |
47
|
Shen, Z. X. and Jong, C. C. 1997. Exploring module selection space for architectural synthesis of low power designs. In IEEE International Symposium on Circuits and Systems.
|
| |
48
|
|
| |
49
|
Synopsys Inc. Design compiler and primepower manual. www.synopsys.com.
|
| |
50
|
|
CITED BY 5
|
|
Alex K. Jones , Raymond Hoare , Swapna Dontharaju , Shenchih Tung , Ralph Sprang , Joshua Fazekas , James T. Cain , Marlin H. Mickle, An automated, FPGA-based reconfigurable, low-power RFID tag, Microprocessors & Microsystems, v.31 n.2, p.116-134, March, 2007
|
|
|
Swapna Dontharaju , Shenchih Tung , James T. Cain , Leonid Mats , Marlin H. Mickle , Alex K. Jones, A design automation and power estimation flow for RFID systems, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.14 n.1, p.1-31, January 2009
|
|
|
Alex K. Jones , Swapna Dontharaju , Shenchih Tung , Leo Mats , Peter J. Hawrylak , Raymond R. Hoare , James T. Cain , Marlin H. Mickle, Radio frequency identification prototyping, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.13 n.2, p.1-22, April 2008
|
|
|
|
|
|
|
|