|
ABSTRACT
Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
AHO74
|
|
| |
AKL87
|
|
| |
BIC85
|
Bic L. and Hartman R. L. 11985) "Hither Hundreds of Processors in a Database Machine," Proceedings of the 1985 International Workshop on Database Machines, Springer Verlag.
|
| |
BLAS77
|
Blasgen M. and Eswaran K. (1977) "Storage and Access in Relational Databases," IBM Sysfemr Journal , Vol. 4, p. 363.
|
| |
BLUM72
|
Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L. and Tarjan, R.E. (1972) "Time Bounds for Selection,", Journal of Computer and System Sciences, Vol. 7,448 - 461.
|
| |
BRAT84
|
|
| |
CHRI83
|
Christodoulakis S. (1983) "Estimating Record Selectivities,"Inrrmafion .Sysfem.r, Vol. 8, No. 2, 105 - 115.
|
| |
COFF73
|
|
| |
COFF78
|
Coffman E., Carey M. and Johnson, D.S. (1978) "An ADDhcation of Bin Pa&m? to Multiorocessor Schedr&g," SIAM Journal of &nputing, Vol. 7, l-17.
|
| |
CORN86
|
|
| |
DEMU85
|
Demurjian S.A., Hsiao D.K., Kerr D.S., Menon J., Strawscr P.R., Tekampe R.C., Trimble J. and Watson R.J. (1985) "Performance Evaluation of a Database System in Multiple Backend Configurations," Proceedings of the 1985 international Workshop on Database Machines, Springer Verlag.
|
| |
DEWI85
|
Dewitt, D.J. and Gerber R.H. (1985) "Multiorocessor Hash-based join algorithms," Pioceebings of* the 11th International Conference on Verv Lame Databases.
|
| |
DEWI86
|
David J. DeWitt , Robert H. Gerber , Goetz Graefe , Michael L. Heytens , Krishna B. Kumar , M. Muralikrishna, GAMMA - A High Performance Dataflow Database Machine, Proceedings of the 12th International Conference on Very Large Data Bases, p.228-237, August 25-28, 1986
|
| |
DEWI87
|
Dewitt, D.J., Smith M. and Boral H. (1987) "A Single-User Performance Evaluation of the Teradata Database Machine," h4 CC Technical Report 88-081-87.
|
| |
FRED82
|
Frederickson G. and Johnson D.B. (1982) +The Complexity of Selection and Ranking in X + Y and Matrices with Sorted Columns," Journal of Computer and System Sciences, Vol. 24, 197-209.
|
| |
FRED84
|
|
 |
GALI79
|
|
| |
GRAH69
|
Graham R. (1969) "Bounds on Multiprocessing Timing Anomalies," SIAM Journal of Appl. Math., Vol. 17, No. 2, 416-429.
|
| |
HSIA83
|
|
| |
IBAR88
|
Ibamki, T. and Katoh, N. (1988) Resource Allocation Problems. MIT Press.
|
| |
IYER88
|
Iyer, B.R., and Dias, D.M. (1988) "System Issues in Parallel Sorling for Database Systems," IBM Research Reoort RJ 6585.
|
| |
IYER89
|
|
| |
KITS83
|
Kitsuregawa, M., Tanaka, II. and Motooka, T., Applicafion of Hash to Data Bare Machine and its Archileclure New Generation Computing, Vol. 1, No. 1, 1983.
|
| |
KNUT73
|
|
| |
LAKS88
|
|
| |
LAKS89a
|
|
| |
LAKS89b
|
Lakshmi S. and Yu P.S. (1989) "Analysis of Parallel Processing Architectures for Database System," Proceedings of the I989 Intf. Con/: on Parallel Processing.
|
| |
LYNC88
|
|
| |
MONT83
|
Montgomery A.Y., D'Souza D.J. and Lee S.B. (1983) The Cost of Relational Algebraic Operations on Skewed Data: Estimates and Experiments," Informalion Processing 83, IFIP.
|
| |
NECH84
|
|
| |
OZKA86
|
|
| |
QADA85
|
Qadah, G.Z. (1985) "The Equi-Join Operation on a Multiprocessor Database Machine: Algorithms and the Evaluation of their Performance," Proceedings of the 1985 Inlemational Workshoo on Database Machines, Springer Verlag, 35 - 67.
|
| |
SALZ83
|
Salza S., Tenanova M. and Velardi P. (1983) "Performance Modelinn of the DBMAC Architecture," Proceedings of the- 1983 International Workshop & Database Machines, Springer-Verlag, 74 - 90.
|
 |
SCHN89
|
|
| |
STON86
|
Stonebraker. M. (1986) "The Case for Shared Nothing," tEEE batab&e E&ineering, Vol. 9, No. I.
|
 |
TANT88
|
A. N. Tantawi , G. Towsley , J. Wolf, Optimal allocation of multiple class resources in computer systems, Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems, p.253-260, May 24-27, 1988, Santa Fe, New Mexico, United States
|
 |
VALD84
|
|
| |
WOLF90
|
Wolf, J.L., Dias, D.M., and Yu, P.S., "An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew", IBM Research Report RC 15510, 1990.
|
| |
YU87
|
Yu, P.S., Dias, D.M., Robinson, J.T., lyer, B.R., and Cornell, D.W. (1987) "On Coupling Multi-Systems Through Data Sharing", Proceedings of the IEEE, Vol. 75, No. 5, 573-587.
|
| |
ZIPF49
|
Zipf G.K. (1949) Human Behavior and the Principle of Least Efforts Addison-Wesley.
|
CITED BY 19
|
|
|
|
|
John Turek , Joel L. Wolf , Philip S. Yu, Approximate algorithms scheduling parallelizable tasks, Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, p.323-332, June 29-July 01, 1992, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|