|
ABSTRACT
The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Teraflops Research Chip. http://www.intel.com/research/platform/terascale/teraflops.htm.
|
 |
3
|
|
| |
4
|
S. F. Lundstrom and G. H. Barnes. A controllable MIMD architectures. In Proceedings of the 1980 International Conference on Parallel Processing, pages 19--27, St. Charles, IL, August 1980.
|
| |
5
|
SPEC CFP2000. http://www.spec.org/cpu2000/CFP2000.
|
 |
6
|
|
| |
7
|
R. Sakellariou. On the Quest for Perfect Load Balance in Loop-Based Parallel Computations. PhD thesis, Department of Computer Science, University of Manchester, October 1996.
|
| |
8
|
C. Polychronopoulos, D. J. Kuck, and D. A. Padua. Execution of parallel loops on parallel processor systems. In Proceedings of the 1986 International Conference on Parallel Processing, pages 519--527, August 1986.
|
| |
9
|
|
 |
10
|
Arun Kejariwal , Alexandru Nicolau , Hideki Saito , Xinmin Tian , Milind Girkar , Utpal Banerjee , Constantine D. Polychronopoulos, A general approach for partitioning N-dimensional parallel nested loops with conditionals, Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures, July 30-August 02, 2006, Cambridge, Massachusetts, USA
[doi> 10.1145/1148109.1148117]
|
 |
11
|
Arun Kejariwal , Alexandru Nicolau , Utpal Banerjee , Constantine D. Polychronopoulos, A novel approach for partitioning iteration spaces with variable densities, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 15-17, 2005, Chicago, IL, USA
[doi> 10.1145/1065944.1065962]
|
| |
12
|
A. Kejariwal, P. D'Alberto, A. Nicolau, and C. D. Polychronopoulos. A geometric approach for partitioning N-dimensional non-rectangular iteration spaces. In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing, pages 102--116, West Lafayette, IN, 2004.
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
 |
17
|
Siddhartha Chatterjee , Erin Parker , Philip J. Hanlon , Alvin R. Lebeck, Exact analysis of the cache behavior of nested loops, Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, p.286-297, June 2001, Snowbird, Utah, United States
|
| |
18
|
|
| |
19
|
C. Polychronopoulos. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the 1987 International Conference on Parallel Processing, pages 235--242, August 1987.
|
| |
20
|
SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.
|
| |
21
|
SPEC CFP2006. http://www.spec.org/cpu2006/CFP2006.
|
| |
22
|
Intel® VTune#8482; Performance Analyzer 8.0.1 for Windows. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.
|
| |
23
|
|
| |
24
|
OpenMP Specification, version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.
|
 |
25
|
|
| |
26
|
E. W. Weisstein. Abel's impossibility theorem. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/AbelsImpossibilityTheorem.html.
|
 |
27
|
|
| |
28
|
M. J. Wolfe. Iteration space tiling for memory hierarchies, December 1987.
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
 |
32
|
|
 |
33
|
|
| |
34
|
Intel® Compilers for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.
|
| |
35
|
D. Kuck, A. H. Sameh, R. Cytron, A. Veidenbaum, C. D. Polychronopoulos, G. Lee, T. McDaniel, B. R. Leasure, C. Beckman, J. R. B Davies, and C. P. Kruskal. The effects of program restructuring, algorithm change and architecture choice on program performance. In Proceedings of the 1984 International Conference on Parallel Processing, pages 129--138, August 1984.
|
| |
36
|
|
 |
37
|
D. Kulkarni , K. G. Kumar , A. Basu , A. Paulraj, Loop partitioning for distributed memory multiprocessors as unimodular transformations, Proceedings of the 5th international conference on Supercomputing, p.206-215, June 17-21, 1991, Cologne, West Germany
[doi> 10.1145/109025.109079]
|
| |
38
|
M. O'Boyle and G. A. Hedayat. Program and data transformations for efficient execution on distributed memory architectures. Technical Report UMCS-93-1-6, Department of Computer Science, University of Manchester, 1992.
|
| |
39
|
J. Sheu and T. Thai. Partitioning and mapping nested for-loops on multiprocessor systems. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.
|
| |
40
|
|
| |
41
|
|
 |
42
|
|
 |
43
|
|
| |
44
|
E. Pegg., T. Rowland, and E. W. Weisstein. Cayley graph. from mathworld--a wolfram web resource. http://mathworld.wolfram.com/CayleyGraph.html.
|
| |
45
|
|
 |
46
|
O. Temam , C. Fricker , W. Jalby, Cache interference phenomena, Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, p.261-271, May 16-20, 1994, Nashville, Tennessee, United States
|
 |
47
|
|
 |
48
|
Glenn Ammons , Thomas Ball , James R. Larus, Exploiting hardware performance counters with flow and context sensitive profiling, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.85-96, June 16-18, 1997, Las Vegas, Nevada, United States
|
| |
49
|
F. Schneider and T. Gross. Using platform-specific performance counters for dynamic compilation. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, NY, October 2005.
|
 |
50
|
|
| |
51
|
|
|