ACM Home Page
Please provide us with feedback. Feedback
Map-reduce-merge: simplified relational data processing on large clusters
Full text PdfPdf (518 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data table of contents
Beijing, China
SESSION: Data processing in the large table of contents
Pages: 1029 - 1040  
Year of Publication: 2007
ISBN:978-1-59593-686-8
Authors
Hung-chih Yang  Yahoo!, Sunnyvale, CA
Ali Dasdan  Yahoo!, Sunnyvale, CA
Ruey-Lung Hsiao  UCLA, Los Angeles, CA
D. Stott Parker  UCLA, Los Angeles, CA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 244,   Downloads (12 Months): 1751,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1247480.1247602
What is a DOI?

ABSTRACT

Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.

However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.

We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Apache. Hadoop. http://lucene.apache.org/hadoop/, 2006.
 
2
A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations. In SIGMOD 1997, pages 243--254, 1997.
 
3
E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition, Cambridge, MA, 2005. MIT Press.
 
4
F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218, 2006.
 
5
L. Chu et al. Optimizing Data Aggregation for Cluster-Based Internet Services. In PPOPP, pages 119--130. ACM, 2003.
 
6
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.
 
7
 
8
D. J. DeWitt and Gerber. R. Multiprocessor Hash-Based Join Algorithms. In VLDB 1985, 1985.
9
 
10
S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In SOSP, pages 29--43, 2003.
 
11
J. Gray. Sort Benchmark. http://research.microsoft.com/barc/SortBenchmark/,2006.
12
 
13
M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007.
 
14
R. Lämmel. Google's MapReduce Programming Model - Revisited. Draft; Online since 2 January, 2006; 26 pages, 22 Jan. 2006.
 
15
 
16
Teradata. Teradata. http://www.teradata.com/t/go.aspx, 2006.
 
17
TPC. TPC-H. http://www.tpc.org/tpch/default.asp, 2006.
 
18
Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/Redundant Array of Inexpensive Nodes, 2006.

CITED BY  9

Collaborative Colleagues:
Hung-chih Yang: colleagues
Ali Dasdan: colleagues
Ruey-Lung Hsiao: colleagues
D. Stott Parker: colleagues