ACM Home Page
Please provide us with feedback. Feedback
Cache-aware partitioning of multi-dimensional iteration spaces
Full text PdfPdf (722 KB)
Source ACM International Conference Proceeding Series archive
Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference table of contents
Haifa, Israel
SESSION: Performance optimization and testing table of contents
Article No. 15  
Year of Publication: 2009
ISBN:978-1-60558-623-6
Authors
Arun Kejariwal  Yahoo! Inc., Santa Clara, CA
Alexandru Nicolau  University of California, Irvine, CA
Utpal Banerjee  University of California, Irvine, CA
Alexander V. Veidenbaum  University of California, Irvine, CA
Constantine D. Polychronopoulos  University of Illinois at Urbana-Champaign, Urbana, IL
Sponsors
: Melanox Technologies
: Hebrew University of Jerusalem
IBM : IBM
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 56,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1534530.1534551
What is a DOI?

ABSTRACT

The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Teraflops Research Chip. http://www.intel.com/research/platform/terascale/teraflops.htm.
3
 
4
S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.
 
5
SPEC CFP2000. http://www.spec.org/cpu2000/CFP2000.
6
 
7
R. Sakellariou. On the Quest for Perfect Load Balance in Loop-Based Parallel Computations. PhD thesis, Department of Computer Science, University of Manchester, October 1996.
 
8
C. Polychronopoulos, D. J. Kuck, and D. A. Padua. Execution of parallel loops on parallel processor systems. In Proceedings of the 1986 International Conference on Parallel Processing, pages 519--527, August 1986.
 
9
10
11
 
12
A. Kejariwal, P. D'Alberto, A. Nicolau, and C. D. Polychronopoulos. A geometric approach for partitioning N-dimensional non-rectangular iteration spaces. In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing, pages 102--116, West Lafayette, IN, 2004.
 
13
 
14
15
 
16
17
 
18
 
19
C. Polychronopoulos. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the 1987 International Conference on Parallel Processing, pages 235--242, August 1987.
 
20
SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.
 
21
SPEC CFP2006. http://www.spec.org/cpu2006/CFP2006.
 
22
Intel® VTune#8482; Performance Analyzer 8.0.1 for Windows. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.
 
23
 
24
OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.
25
 
26
E. W. Weisstein. Abel's impossibility theorem. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/AbelsImpossibilityTheorem.html.
27
 
28
M. J. Wolfe. Iteration space tiling for memory hierarchies, December 1987.
29
 
30
 
31
32
33
 
34
Intel® Compilers for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.
 
35
D. Kuck, A. H. Sameh, R. Cytron, A. Veidenbaum, C. D. Polychronopoulos, G. Lee, T. McDaniel, B. R. Leasure, C. Beckman, J. R. B Davies, and C. P. Kruskal. The effects of program restructuring, algorithm change and architecture choice on program performance. In Proceedings of the 1984 International Conference on Parallel Processing, pages 129--138, August 1984.
 
36
37
 
38
M. O'Boyle and G. A. Hedayat. Program and data transformations for efficient execution on distributed memory architectures. Technical Report UMCS-93-1-6, Department of Computer Science, University of Manchester, 1992.
 
39
J. Sheu and T. Thai. Partitioning and mapping nested for-loops on multiprocessor systems. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.
 
40
 
41
42
43
 
44
E. Pegg., T. Rowland, and E. W. Weisstein. Cayley graph. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/CayleyGraph.html.
 
45
46
47
48
 
49
F. Schneider and T. Gross. Using platform-specific performance counters for dynamic compilation. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, NY, October 2005.
50
 
51

Collaborative Colleagues:
Arun Kejariwal: colleagues
Alexandru Nicolau: colleagues
Utpal Banerjee: colleagues
Alexander V. Veidenbaum: colleagues
Constantine D. Polychronopoulos: colleagues