ACM Home Page
Please provide us with feedback. Feedback
Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions
Full text PdfPdf (510 KB)
Source
International Conference on Compilers, Architecture and Synthesis for Embedded Systems archive
Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems table of contents
Grenoble, France
SESSION: Architectural optimizations table of contents
Pages 117-126  
Year of Publication: 2009
ISBN:978-1-60558-626-7
Authors
Garo Bournoutian  University of California, San Diego, La Jolla, CA, USA
Alex Orailoglu  University of California, San Diego, La Jolla, CA, USA
Sponsors
SIGDA: ACM Special Interest Group on Design Automation
ACM: Association for Computing Machinery
SIGBED: ACM Special Interest Group on Embedded Systems
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 10,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1629395.1629413
What is a DOI?

ABSTRACT

Today, embedded processors are expected to be able to run complex, algorithm-heavy, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing while a memory stall has occurred. The simulation results show significant improvement in execution throughput of approximately 11%, while having a minimal impact on area overhead and power.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Maurice V. Wilkes. The memory gap and the future of high performance memories. SIGARCH Computer Architecture News, pages 2--7, 2001.
 
2
Li Lee, Srikanth Kannan, and Jose Fridman. MPEG4 video codec on a wireless handset baseband system. In Proc. Workshop Media and Signal Processors for Embedded Systems and SoCs, 2004.
 
3
Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Computer Architecture News, pages 364--373, 1990.
 
4
Garo Bournoutian and Alex Orailoglu. Miss reduction in embedded processors through dynamic, power-friendly cache design. In DAC '08: Proceedings of the 45th Annual Conference on Design Automation, pages 304--309, New York, NY, USA, 2008. ACM.
 
5
Eric Sprangle and Doug Carmean. Increasing processor performance by implementing deeper pipelines. SIGARCH Computer Architecture News, pages 25--34, 2002.
 
6
R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, pages 25--33, 1967.
 
7
James E. Smith and Andrew R. Pleszkun. Implementation of precise interrupts in pipelined processors. In ISCA '85: Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 36--44, Los Alamitos, CA, USA, 1985. IEEE Computer Society Press.
 
8
Sebastien Hily and Andr´e Seznec. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In HPCA '99: Proceedings of the 5th International Symposium on High Performance Computer Architecture, pages 64--67, Washington, DC, USA, 1999. IEEE Computer Society.
 
9
J.P. Grossman. Cheap out-of-order execution using delayed issue. In ICCD '00: Proceedings of the 2000 IEEE International Conference on Computer Design, pages 549--551, 2000.
 
10
David Callahan, Ken Kennedy, and Allan Porterfield. Software prefetching. In ASPLOS-IV: Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, New York, NY, USA, 1991. ACM.
 
11
Alexander C. Klaiber and Henry M. Levy. An architecture for software-controlled data prefetching. SIGARCH Computer Architecture News, pages 43--53, 1991.
 
12
Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V: Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, New York, NY, USA, 1992. ACM.
 
13
Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-Wen Tseng. Evaluating the impact of memory system performance on software prefetching and locality optimizations. In ICS '01: Proceedings of the 15th International Conference on Supercomputing, pages 486--500, New York, NY, USA, 2001. ACM.
 
14
Jean-Loup Baer and Tien-Fu Chen. An effective on-chip preloading scheme to reduce data access penalty. In Supercomputing '91: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pages 176--186, New York, NY, USA, 1991. ACM.
 
15
John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride directed prefetching in scalar processors. In MICRO 25: Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 102--110, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.
 
16
Doug Joseph and Dirk Grunwald. Prefetching using markov predictors. In ISCA '97: Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252--263, New York, NY, USA, 1997. ACM.
 
17
Sanghyun Park, Aviral Shrivastava, and Yunheung Paek. Hiding cache miss penalty using priority-based execution for embedded processors. In DATE '08: Proceedings of the Conference on Design, Automation and Test in Europe, pages 1190--1195, 2008.
 
18
Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. SIGOPS Operating Systems Review, pages 2--11, 1996.
 
19
Todd Austin, Eric Larson, and Dan Ernst. Simplescalar: An infrastructure for computer system modeling. Computer, pages 59--67, 2002.
 
20
SPEC CPU2000 Benchmarks. http://www.spec.org/cpu/.
 
21
Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In MICRO 30: Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pages 330--335, Washington, DC, USA, 1997. IEEE Computer Society.
 
22
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In WWC '01: Proceedings of the IEEE International Workshop on Workload Characterization, pages 3--14, Washington, DC, USA, 2001. IEEE Computer Society.
 
23
Daniele Folegnani and Antonio Gonzalez. Energy-effective issue logic. SIGARCH Computer Architecture News, pages 230--239, 2001.
 
24
Steven J. E. Wilton and Norman P. Jouppi. CACTI: An enhanced cache access and cycle time model. IEEE Journal on Solid-State Circuits, 31(5):677--688, 1996.