| Handling data skew in parallel joins in shared-nothing systems |
| Full text |
Pdf
(1.41 MB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Industrial Session 1: Query Optimization and Performance
table of contents
Pages 1043-1052
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Yu Xu
|
Teradata, San Diego, CA, USA
|
|
Pekka Kostamaa
|
Teradata, San Diego, CA, USA
|
|
Xin Zhou
|
Teradata, San Diego, CA, USA
|
|
Liang Chen
|
UCSD, San Diego, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 21, Downloads (12 Months): 212, Citation Count: 0
|
|
|
ABSTRACT
Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
TPC Benchmark H (decision support) standard specification http://www.tpc.org.
|
| |
2
|
|
| |
3
|
|
| |
4
|
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18:143--154, 1979.
|
| |
5
|
Hasanat M. Dewan , Kui W. Mok , Mauricio Hernández , Salvatore J. Stolfo, Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew, Proceedings of the third international conference on on Parallel and distributed information systems, p.40-49, October 1994, Autin, Texas, United States
|
 |
6
|
|
| |
7
|
|
| |
8
|
FrankǎOlken and DoronǎRotem. Random sampling from databases: a survey. Statistics and Computing, 5(1):25--42, 1995.
|
| |
9
|
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.
|
| |
10
|
|
| |
11
|
|
| |
12
|
E. G. C. Jr., M. R. Garey, and D. S. Johnson. An application of bin-packing to multiprocessor scheduling. SIAM J. Comput., 7(1):1--17, 1978.
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
M. Stonebraker. The case for shared nothing. IEEE Database Eng. Bull., 9(1):4--9, 1986.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
X. Zhou and M. E. Orlowska. Handling data skew in parallel hash join computation using two-phase scheduling. In IEEE 1st International Conference on Algorithm and Architecture for Parallel Processing, pages 527--536 vol.2, 1995.
|
|