|
ABSTRACT
Effective use of the memory hierarchy is critical for achieving high performance on embedded systems. We focus on the class of streaming applications, which is increasingly prevalent in the embedded domain. We exploit the widespread parallelism and regular communication patterns in stream programs to formulate a set of cache aware optimizations that automatically improve instruction and data locality. Our work is in the context of the Synchronous Dataflow model, in which a program is described as a graph of independent actors that communicate over channels. The communication rates between actors are known at compile time, allowing the compiler to statically model the caching behavior.We present three cache aware optimizations: 1) execution scaling, which judiciously repeats actor executions to improve instruction locality, 2) cache aware fusion, which combines adjacent actors while respecting instruction cache constraints, and 3) scalar replacement, which converts certain data buffers into a sequence of scalar variables that can be register allocated. The optimizations are founded upon a simple and intuitive model that quantifies the temporal locality for a sequence of actor executions. Our implementation of cache aware optimizations in the StreamIt compiler yields a 249% average speedup (over unoptimized code) for our streaming benchmark suite on a StrongARM 1110 processor. The optimizations also yield a 154% speedup on a Pentium 3 and a 152% speedup on an Itanium 2.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Albert Benveniste , Paul Caspi , Paul Le Guernic , Nicolas Halbwachs, Data-Flow Synchronous Languages, A Decade of Concurrency, Reflections and Perspectives, REX School/Symposium, p.1-45, June 01-04, 1993
|
| |
2
|
|
| |
3
|
C. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. APGAN and RPMC: Complementary Heuristics for Translating DSP Block Diagrams into Efficient Software Implementations. Journal of Design Automation for Embedded Systems., pages 33--60, January 1997.
|
| |
4
|
|
| |
5
|
|
| |
6
|
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. In SIGGRAPH, 2004.
|
| |
7
|
R. Govindarajan, G. Gao, and P. Desai. Minimizing memory requirements in rate-optimal schedules. In Proceedings of the 1994 Int. conference on Application Specific Array Processors, pages 75--86, August 1994.
|
| |
8
|
N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data-flow programming language LUSTRE. Proc. of the IEEE, 79(9), 1991.
|
| |
9
|
Jean-Luc Gaudiot , Tom DeBoni , John Feo , Wim Böhm , Walid Najjar , Patrick Miller, The Sisal Model of Functional Programming and its Implementation, Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis, p.112, March 17-21, 1997
|
 |
10
|
Michal Karczmarek , William Thies , Saman Amarasinghe, Phased scheduling of stream programs, Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems, June 11-13, 2003, San Diego, California, USA
|
| |
11
|
S. Kohli. Cache aware scheduling of synchronous dataflow programs. Master's Report Technical Memorandum UCB/URL M04/03, UC Berkeley, 2004.
|
| |
12
|
E. A. Lee. Overview of the Ptolemy Project. Technical report, Tech Memo UCB/ERL M03/25, UC Berkeley, 2003.
|
| |
13
|
|
| |
14
|
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 1970.
|
| |
15
|
|
| |
16
|
P. K. Murthy and S. S. Bhattacharyya. Buffer Merging --- A Powerful Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications. Technical report, Inst. for Adv. Computer Studies, UMD College Park, 2000.
|
 |
17
|
|
| |
18
|
J. Sermulins. In preparation. Master's thesis, MIT CSAIL, 2005.
|
| |
19
|
R. Stephens. A Survey of Stream Processing. Acta Informatica, 34(7), 1997.
|
| |
20
|
D. Tennenhouse and V. Bose. The SpectrumWare Approach to Wireless Signal Processing. Wireless Networks, 1999.
|
| |
21
|
|
CITED BY 5
|
|
|
|
|
Arno Moonen , Marco Bekooij , René van den Berg , Jef van Meerbergen, Cache aware mapping of streaming applications on a multiprocessor system-on-chip, Proceedings of the conference on Design, automation and test in Europe, March 10-14, 2008, Munich, Germany
|
|
|
|
|
|
|
|
|
Andreas Dahlin , Johan Ersfolk , Guyfu Yang , Haitham Habli , Johan Lilius, The canals language and its compiler, Proceedings of th 12th International Workshop on Software and Compilers for Embedded Systems, April 23-24, 2009, Nice, France
|
INDEX TERMS
Primary Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Optimization
Additional Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.2
Language Classifications
Subjects:
Concurrent, distributed, and parallel languages;
Data-flow languages
D.3.4
Processors
Subjects:
Compilers;
Code generation
General Terms:
Design,
Languages,
Performance
Keywords:
StreamIt,
cache,
cache optimizations,
embedded,
fusion,
stream programing,
synchronous dataflow
|