ACM Home Page
Please provide us with feedback. Feedback
Optimizing inter-processor data locality on embedded chip multiprocessors
Full text PdfPdf (439 KB)
Source International Conference On Embedded Software archive
Proceedings of the 5th ACM international conference on Embedded software table of contents
Jersey City, NJ, USA
SESSION: Clocks and energy table of contents
Pages: 227 - 236  
Year of Publication: 2005
ISBN:1-59593-091-4
Authors
G. Chen  Pennsylvania State University, University Park, PA
M. Kandemir  Pennsylvania State University, University Park, PA
Sponsors
SIGBED: ACM Special Interest Group on Embedded Systems
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 31,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1086228.1086271
What is a DOI?

ABSTRACT

Recent research in embedded computing indicates that packing multiple processor cores on the same die is an effective way of utilizing the ever-increasing number of transistors. The advantage of placing multiple cores into a single die is that it reduces on-chip communication costs (in terms of both execution cycles and power consumption) between the processor cores that are traditionally very high in conventional high-performance parallel architectures (such as SMPs). However, on the negative side, this tighter integration exerts an even higher pressure on off-chip accesses to the memory system. This makes minimizing the number of off-chip accesses a critical optimization goal.This paper discusses a compiler-based solution to this problem for the embedded applications that perform stencil computations. An important characteristic of this solution is that it distinguishes between the intra-processor data reuse and inter-processor data reuse. The first of these captures the data reuse that occurs across loop iterations assigned to the same processor, whereas the second one represents the data reuse that takes place across the loop iterations assigned to different processors. The proposed approach then optimizes inter-processor reuse by re-organizing the loop iterations of each processor carefully, considering how data elements are shared across processors. The goal is to ensure that the different processors access the shared data within a short period of time, so that the data can be captured in the on-chip memory space at the time of the reuse. This paper also presents an evaluation of the proposed optimization and compares it to an alternate scheme that optimizes data locality for each processor in isolation. The results obtained by applying our implementation to eight loop-intensive benchmark codes from the embedded computing domain show that our approach improves over the mentioned alternate scheme by 15.6% on average.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
E. H. Bareiss. Sylvester's Identity and Multistep Integer-Preserving Gaussian Elimination. Mathematics of Computation, 22(103):565--578, July 1968.
4
 
5
R. G. Brickner, W. George, S. L. Johnsson, and A. Ruttenberg. A stencil compiler for the connection machine models CM-2/200. Technical Report TR-22-93, Center for Research in Computing Technology, Harvard University, December 1993.
 
6
R. G. Brickner, K. Holian, B. Thiagarajan, and S. L. Johnsson. A stencil compiler for the Connection Machine model CM-5. Technical Report CRPC-TR94457, Center for Research on Parallel Computation, Rice University, June 1994.
7
8
 
9
 
10
K. Davis and F. Bassetti. Exploiting temporal locality in stencil based applications. In Proc. International Conference on Information Systems Analysis and Synthesis, 1999.
11
 
12
 
13
F. F. Lee. Partitioning of regular computation on multiprocessor systems. Journal of Parallel and Distributed Computing, 9:312--317, July 1990.
 
14
S.-T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report 95-09-01, University of Washington, September 1995.
 
15
16
 
17
S. Richardson. MPOC: A chip multiprocessor for embedded systems. Technical Report HPL-2002-186, HP Labs, 2002.
18
 
19
SIMICS Toolset. http://www.virtutech.com.
 
20
SUIF Compiler Infrastructure. http://suif.stanford.edu/
21
22
 
23
 
24