ACM Home Page
Please provide us with feedback. Feedback
An effective algorithm for parallelizing sort merge joins in the presence of data skew
Full text PdfPdf (1.29 MB)
Source International Symposium on Databases for Parallel and Distributed Systems archive
Proceedings of the second international symposium on Databases in parallel and distributed systems table of contents
Dublin, Ireland
Pages: 103 - 115  
Year of Publication: 1990
ISBN:0-8186-2052-8
Authors
Joel L. Wolf  P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center
Daniel M. Dias  P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center
Philip S. Yu  P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center
Sponsors
IEEE-CS\TCDE : TC on Data Engineering
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 24,   Citation Count: 19
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/319057.319072
What is a DOI?

ABSTRACT

Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
AHO74
 
AKL87
 
BIC85
Bic L. and Hartman R. L. 11985) "Hither Hundreds of Processors in a Database Machine," Proceedings of the 1985 International Workshop on Database Machines, Springer Verlag.
 
BLAS77
Blasgen M. and Eswaran K. (1977) "Storage and Access in Relational Databases," IBM Sysfemr Journal , Vol. 4, p. 363.
 
BLUM72
Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L. and Tarjan, R.E. (1972) "Time Bounds for Selection,", Journal of Computer and System Sciences, Vol. 7,448 - 461.
 
BRAT84
 
CHRI83
Christodoulakis S. (1983) "Estimating Record Selectivities,"Inrrmafion .Sysfem.r, Vol. 8, No. 2, 105 - 115.
 
COFF73
 
COFF78
Coffman E., Carey M. and Johnson, D.S. (1978) "An ADDhcation of Bin Pa&m? to Multiorocessor Schedr&g," SIAM Journal of &nputing, Vol. 7, l-17.
 
CORN86
 
DEMU85
Demurjian S.A., Hsiao D.K., Kerr D.S., Menon J., Strawscr P.R., Tekampe R.C., Trimble J. and Watson R.J. (1985) "Performance Evaluation of a Database System in Multiple Backend Configurations," Proceedings of the 1985 international Workshop on Database Machines, Springer Verlag.
 
DEWI85
Dewitt, D.J. and Gerber R.H. (1985) "Multiorocessor Hash-based join algorithms," Pioceebings of* the 11th International Conference on Verv Lame Databases.
 
DEWI86
 
DEWI87
Dewitt, D.J., Smith M. and Boral H. (1987) "A Single-User Performance Evaluation of the Teradata Database Machine," h4 CC Technical Report 88-081-87.
 
FRED82
Frederickson G. and Johnson D.B. (1982) +The Complexity of Selection and Ranking in X + Y and Matrices with Sorted Columns," Journal of Computer and System Sciences, Vol. 24, 197-209.
 
FRED84
GALI79
 
GRAH69
Graham R. (1969) "Bounds on Multiprocessing Timing Anomalies," SIAM Journal of Appl. Math., Vol. 17, No. 2, 416-429.
 
HSIA83
 
IBAR88
Ibamki, T. and Katoh, N. (1988) Resource Allocation Problems. MIT Press.
 
IYER88
Iyer, B.R., and Dias, D.M. (1988) "System Issues in Parallel Sorling for Database Systems," IBM Research Reoort RJ 6585.
 
IYER89
 
KITS83
Kitsuregawa, M., Tanaka, II. and Motooka, T., Applicafion of Hash to Data Bare Machine and its Archileclure New Generation Computing, Vol. 1, No. 1, 1983.
 
KNUT73
 
LAKS88
 
LAKS89a
 
LAKS89b
Lakshmi S. and Yu P.S. (1989) "Analysis of Parallel Processing Architectures for Database System," Proceedings of the I989 Intf. Con/: on Parallel Processing.
 
LYNC88
 
MONT83
Montgomery A.Y., D'Souza D.J. and Lee S.B. (1983) The Cost of Relational Algebraic Operations on Skewed Data: Estimates and Experiments," Informalion Processing 83, IFIP.
 
NECH84
 
OZKA86
 
QADA85
Qadah, G.Z. (1985) "The Equi-Join Operation on a Multiprocessor Database Machine: Algorithms and the Evaluation of their Performance," Proceedings of the 1985 Inlemational Workshoo on Database Machines, Springer Verlag, 35 - 67.
 
SALZ83
Salza S., Tenanova M. and Velardi P. (1983) "Performance Modelinn of the DBMAC Architecture," Proceedings of the- 1983 International Workshop & Database Machines, Springer-Verlag, 74 - 90.
SCHN89
 
STON86
Stonebraker. M. (1986) "The Case for Shared Nothing," tEEE batab&e E&ineering, Vol. 9, No. I.
TANT88
VALD84
 
WOLF90
Wolf, J.L., Dias, D.M., and Yu, P.S., "An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew", IBM Research Report RC 15510, 1990.
 
YU87
Yu, P.S., Dias, D.M., Robinson, J.T., lyer, B.R., and Cornell, D.W. (1987) "On Coupling Multi-Systems Through Data Sharing", Proceedings of the IEEE, Vol. 75, No. 5, 573-587.
 
ZIPF49
Zipf G.K. (1949) Human Behavior and the Principle of Least Efforts Addison-Wesley.

CITED BY  19

Collaborative Colleagues:
Joel L. Wolf: colleagues
Daniel M. Dias: colleagues
Philip S. Yu: colleagues