|
ABSTRACT
Today, embedded processors are expected to be able to run complex, algorithm-heavy, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing while a memory stall has occurred. The simulation results show significant improvement in execution throughput of approximately 11%, while having a minimal impact on area overhead and power.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Maurice V. Wilkes. The memory gap and the future of high performance memories. SIGARCH Computer Architecture News, pages 2--7, 2001.
|
| |
2
|
Li Lee, Srikanth Kannan, and Jose Fridman. MPEG4 video codec on a wireless handset baseband system. In Proc. Workshop Media and Signal Processors for Embedded Systems and SoCs, 2004.
|
| |
3
|
Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Computer Architecture News, pages 364--373, 1990.
|
| |
4
|
Garo Bournoutian and Alex Orailoglu. Miss reduction in embedded processors through dynamic, power-friendly cache design. In DAC '08: Proceedings of the 45th Annual Conference on Design Automation, pages 304--309, New York, NY, USA, 2008. ACM.
|
| |
5
|
Eric Sprangle and Doug Carmean. Increasing processor performance by implementing deeper pipelines. SIGARCH Computer Architecture News, pages 25--34, 2002.
|
| |
6
|
R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, pages 25--33, 1967.
|
| |
7
|
James E. Smith and Andrew R. Pleszkun. Implementation of precise interrupts in pipelined processors. In ISCA '85: Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 36--44, Los Alamitos, CA, USA, 1985. IEEE Computer Society Press.
|
| |
8
|
Sebastien Hily and Andr´e Seznec. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In HPCA '99: Proceedings of the 5th International Symposium on High Performance Computer Architecture, pages 64--67, Washington, DC, USA, 1999. IEEE Computer Society.
|
| |
9
|
J.P. Grossman. Cheap out-of-order execution using delayed issue. In ICCD '00: Proceedings of the 2000 IEEE International Conference on Computer Design, pages 549--551, 2000.
|
| |
10
|
David Callahan, Ken Kennedy, and Allan Porterfield. Software prefetching. In ASPLOS-IV: Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, New York, NY, USA, 1991. ACM.
|
| |
11
|
Alexander C. Klaiber and Henry M. Levy. An architecture for software-controlled data prefetching. SIGARCH Computer Architecture News, pages 43--53, 1991.
|
| |
12
|
Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V: Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, New York, NY, USA, 1992. ACM.
|
| |
13
|
Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-Wen Tseng. Evaluating the impact of memory system performance on software prefetching and locality optimizations. In ICS '01: Proceedings of the 15th International Conference on Supercomputing, pages 486--500, New York, NY, USA, 2001. ACM.
|
| |
14
|
Jean-Loup Baer and Tien-Fu Chen. An effective on-chip preloading scheme to reduce data access penalty. In Supercomputing '91: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pages 176--186, New York, NY, USA, 1991. ACM.
|
| |
15
|
John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride directed prefetching in scalar processors. In MICRO 25: Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 102--110, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.
|
| |
16
|
Doug Joseph and Dirk Grunwald. Prefetching using markov predictors. In ISCA '97: Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252--263, New York, NY, USA, 1997. ACM.
|
| |
17
|
Sanghyun Park, Aviral Shrivastava, and Yunheung Paek. Hiding cache miss penalty using priority-based execution for embedded processors. In DATE '08: Proceedings of the Conference on Design, Automation and Test in Europe, pages 1190--1195, 2008.
|
| |
18
|
Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. SIGOPS Operating Systems Review, pages 2--11, 1996.
|
| |
19
|
Todd Austin, Eric Larson, and Dan Ernst. Simplescalar: An infrastructure for computer system modeling. Computer, pages 59--67, 2002.
|
| |
20
|
SPEC CPU2000 Benchmarks. http://www.spec.org/cpu/.
|
| |
21
|
Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In MICRO 30: Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pages 330--335, Washington, DC, USA, 1997. IEEE Computer Society.
|
| |
22
|
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In WWC '01: Proceedings of the IEEE International Workshop on Workload Characterization, pages 3--14, Washington, DC, USA, 2001. IEEE Computer Society.
|
| |
23
|
Daniele Folegnani and Antonio Gonzalez. Energy-effective issue logic. SIGARCH Computer Architecture News, pages 230--239, 2001.
|
| |
24
|
Steven J. E. Wilton and Norman P. Jouppi. CACTI: An enhanced cache access and cycle time model. IEEE Journal on Solid-State Circuits, 31(5):677--688, 1996.
|
|