|
ABSTRACT
The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, 1991.
|
 |
2
|
Bradley J. Barnes , Barry Rountree , David K. Lowenthal , Jaxk Reeves , Bronis de Supinski , Martin Schulz, A regression-based approach to scalability prediction, Proceedings of the 22nd annual international conference on Supercomputing, June 07-12, 2008, Island of Kos, Greece
[doi> 10.1145/1375527.1375580]
|
 |
3
|
Bernhard E. Boser , Isabelle M. Guyon , Vladimir N. Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, p.144-152, July 27-29, 1992, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/130385.130401]
|
| |
4
|
|
| |
5
|
F. Blagojevic, X. Feng, et al. Modeling multi-grain parallelism on heterogeneous multicore processors: A case study of the Cell BE. In HiPEAC'08, 2008.
|
| |
6
|
John Cavazos , Grigori Fursin , Felix Agakov , Edwin Bonilla , Michael F. P. O'Boyle , Olivier Temam, Rapidly Selecting Good Compiler Optimizations using Performance Counters, Proceedings of the International Symposium on Code Generation and Optimization, p.185-197, March 11-14, 2007
[doi> 10.1109/CGO.2007.32]
|
 |
7
|
Keith D. Cooper , Philip J. Schielke , Devika Subramanian, Optimizing for reduced code space using genetic algorithms, Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems, p.1-9, May 05-05, 1999, Atlanta, Georgia, United States
|
| |
8
|
|
| |
9
|
|
 |
10
|
David E. Culler , Richard M. Karp , David Patterson , Abhijit Sahay , Eunice E. Santos , Klaus Erik Schauser , Ramesh Subramonian , Thorsten von Eicken, LogP: a practical model of parallel computation, Communications of the ACM, v.39 n.11, p.78-85, Nov. 1996
[doi> 10.1145/240455.240477]
|
 |
11
|
|
| |
12
|
M. R. Guthaus, J. S. Ringenberg, et al. Mibench: A free, commercially representative embedded benchmark suite, 2001.
|
| |
13
|
H. Hofstee. Future microprocessors and off-chip SOP interconnect. Advanced Packaging, IEEE Transactions on, 27(2):301--303, May 2004.
|
 |
14
|
Ilya Sharapov , Robert Kroeger , Guy Delamarter , Razvan Cheveresan , Matthew Ramsay, A case study in top-down performance estimation for a large-scale parallel application, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
[doi> 10.1145/1122971.1122985]
|
| |
15
|
E. Ipek, B. R. de Supinski, et al. An approach to performance prediction for parallel applications. In Euro-Par'05, 2005.
|
 |
16
|
|
| |
17
|
|
| |
18
|
C. Lee. UTDSP benchmark suite, http://www.eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html.
|
 |
19
|
|
| |
20
|
C. Liao and B. Chapman. A compile-time cost model for OpenMP. In IPDPS'07, 2007.
|
 |
21
|
|
| |
22
|
S. Long, G. Fursin, et al. A cost-aware parallel workload allocation approach based on machine learning. In NPC '07, 2007.
|
 |
23
|
Chi-Keung Luk , Robert Cohn , Robert Muth , Harish Patil , Artur Klauser , Geoff Lowney , Steven Wallace , Vijay Janapa Reddi , Kim Hazelwood, Pin: building customized program analysis tools with dynamic instrumentation, Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, June 12-15, 2005, Chicago, IL, USA
|
| |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
Xinmin Tian , Milind Girkar , Sanjiv Shah , Douglas Armstrong , Ernesto Su , Paul Petersen, Compiler and Runtime Support for Running OpenMP Programs on Pentium- and Itanium-Architectures, Proceedings of the 17th International Symposium on Parallel and Distributed Processing, p.130.1, April 22-26, 2003
|
| |
28
|
|
|