ACM Home Page
Please provide us with feedback. Feedback
Efficient bulk insertion into a distributed ordered table
Full text PdfPdf (525 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Research Session 16: Transactions and Distribution table of contents
Pages 765-778  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Adam Silberstein  Yahoo! Research, Santa Clara, CA, USA
Brian F. Cooper  Yahoo! Research, Santa Clara, CA, USA
Utkarsh Srivastava  Yahoo! Research, Santa Clara, CA, USA
Erik Vee  Yahoo! Research, Santa Clara, CA, USA
Ramana Yerneni  Yahoo! Research, Santa Clara, CA, USA
Raghu Ramakrishnan  Yahoo! Research, Santa Clara, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 27,   Downloads (12 Months): 275,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376693
What is a DOI?

ABSTRACT

We study the problem of bulk-inserting records into tables in a system that horizontally range-partitions data over a large cluster of shared-nothing machines. Each table partition contains a contiguous portion of the table's key range, and must accept all records inserted into that range. Examples of such systems include BigTable[8] at Google, and PNUTS [15] at Yahoo! During bulk inserts into an existing table, if most of the inserted records end up going into a small number of data partitions, the obtained throughput may be very poor due to ineffective use of cluster parallelism. We propose a novel approach in which a planning phase is invoked before the actual insertions. By creating new partitions and intelligently distributing partitions across machines, the planning phase ensures that the insertion load will be well-balanced. Since there is a tradeoff between the cost of moving partitions and the resulting throughput gain, the planning phase must minimize the sum of partition movement time and insertion time. We show that this problem is a variation of NP-hard bin-packing, reduce it to a problem of packing vectors, and then give a solution with provable approximation guarantees. We evaluate our approach on a prototype system deployed on a cluster of 50 machines, and show that it yields significant improvements over more naïve techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
GridFTP. http://www.globus.org/grid_software/data/gridftp.php.
 
2
Hacmp for system p servers. http://www-03.ibm.com/systems/p/software/hacmp/index.html.
 
3
Oracle real application clusters 11g. http://www.oracle.com/technology/products/database/clustering/index.html.
 
4
Scalability and performance with oracle 11g database. http://www.oracle.com/technology/deploy/performance/pdf/db_perfscale_11gr1_twp.pdf, 2007.
5
 
6
P. Bernstein, N. Dani, B. Khessib, R. Manne, and D. Shutt. Data management issues in supporting large-scale web services. IEEE Data Engineering Bulletin, December 2006.
 
7
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In Proc. INFOCOM, 1999.
 
8
 
9
 
10
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!'s hosted data serving platform. Technical report, Yahoo! Research, 2008.
 
11
12
 
13
14
15
 
16
S. Hung and J. Fisk. An algorithm for 0-1 multiple knapsack problems. In Naval Res. Logist. Quart., pages 571--579, 1978.
 
17
 
18
R. Loulou and E. Michaelides. New greedy-like heuristics for the multidimensional 0-1 knapsack problem. Operations Research, 27:1101--1114, 1979.
 
19
 
20
S. Mishra. Loading bulk data into a partitioned table. http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/loading_bulk_data_partitioned_table.mspx, September 2006.
21
 
22
 
23
S. Senju and Y. Toyoda. An approach to linear programming with 0-1 variables. Management Science, 15(4):B196--B207, 1968.
 
24
 
25
 
26
M. Vasquez and J. Hao. A hybrid approach for the 01 multidimensional knapsack problem. In IJCAI, pages 328--333, 2001.
 
27
 
28


Collaborative Colleagues:
Adam Silberstein: colleagues
Brian F. Cooper: colleagues
Utkarsh Srivastava: colleagues
Erik Vee: colleagues
Ramana Yerneni: colleagues
Raghu Ramakrishnan: colleagues