|
ABSTRACT
In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms.In our previous work, we have developed a number of techniques for parallelizing this class of reductions. Our previous work has shown that each of the three techniques, full replication, optimized full locking, and cache-sensitive, can outperform others depending upon problem, dataset, and machine parameters. Therefore, an important question is, "Can we predict the performance of these techniques for a given problem, dataset, and machine?".This paper addresses this question by developing an analytical performance model that captures a two-level cache, coherence cache misses, TLB misses, locking overheads, and contention for memory. Analytical model is combined with results from micro-benchmarking to predict performance on real machines. We have validated our model on two different SMP machines. Our results show that our model effectively captures the impact of memory hierarchy (two-level cache and TLB) as well as the factors that limit parallelism (contention for locks, memory contention, and coherence cache misses). The difference between predicted and measured performance is within 20% in almost all cases. Moreover, the model is quite accurate in predicting the relative performance of the three parallelization techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
 |
3
|
Siddhartha Chatterjee , Erin Parker , Philip J. Hanlon , Alvin R. Lebeck, Exact analysis of the cache behavior of nested loops, Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, p.286-297, June 2001, Snowbird, Utah, United States
|
 |
4
|
David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh Subramonian , Thorsten von Eicken, LogP: towards a realistic model of parallel computation, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.1-12, May 19-22, 1993, San Diego, California, United States
|
| |
5
|
|
 |
6
|
Matthew I. Frank , Anant Agarwal , Mary K. Vernon, LoPC: modeling contention in parallel algorithms, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.276-287, June 18-21, 1997, Las Vegas, Nevada, United States
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
Ruoming Jin and Gagan Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining, April 2001.
|
| |
11
|
Ruoming Jin and Gagan Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In Proceedings of the second SIAM conference on Data Mining, April 2002.
|
 |
12
|
|
| |
13
|
|
| |
14
|
Larry W. McVoy and Carl Staelin. lmbench: Portable tools for performance analysis. In USENIX Annual Technical Conference, pages 279-294, 1996.
|
 |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
Daniel J. Sorin , Vijay S. Pai , Sarita V. Adve , Mary K. Vernon , David A. Wood, Analytic evaluation of shared-memory systems with ILP processors, Proceedings of the 25th annual international symposium on Computer architecture, p.380-391, June 27-July 02, 1998, Barcelona, Spain
|
 |
19
|
|
 |
20
|
|
 |
21
|
M. K. Vernon , E. D. Lazowska , J. Zahorjan, An accurate and efficient performance analysis technique for multiprocessor snooping cache-consistency protocols, Proceedings of the 15th Annual International Symposium on Computer architecture, p.308-315, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
| |
22
|
|
|