| Map-reduce-merge: simplified relational data processing on large clusters |
| Full text |
Pdf
(518 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
table of contents
Beijing, China
SESSION: Data processing in the large
table of contents
Pages: 1029 - 1040
Year of Publication: 2007
ISBN:978-1-59593-686-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 244, Downloads (12 Months): 1751, Citation Count: 9
|
|
|
ABSTRACT
Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning. However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins. We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Apache. Hadoop. http://lucene.apache.org/hadoop/, 2006.
|
| |
2
|
A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations. In SIGMOD 1997, pages 243--254, 1997.
|
| |
3
|
E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition, Cambridge, MA, 2005. MIT Press.
|
| |
4
|
F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218, 2006.
|
| |
5
|
L. Chu et al. Optimizing Data Aggregation for Cluster-Based Internet Services. In PPOPP, pages 119--130. ACM, 2003.
|
| |
6
|
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.
|
| |
7
|
David J. DeWitt , Robert H. Gerber , Goetz Graefe , Michael L. Heytens , Krishna B. Kumar , M. Muralikrishna, GAMMA - A High Performance Dataflow Database Machine, Proceedings of the 12th International Conference on Very Large Data Bases, p.228-237, August 25-28, 1986
|
| |
8
|
D. J. DeWitt and Gerber. R. Multiprocessor Hash-Based Join Algorithms. In VLDB 1985, 1985.
|
 |
9
|
|
| |
10
|
S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In SOSP, pages 29--43, 2003.
|
| |
11
|
J. Gray. Sort Benchmark. http://research.microsoft.com/barc/SortBenchmark/,2006.
|
 |
12
|
Jim Gray , David T. Liu , Maria Nieto-Santisteban , Alex Szalay , David J. DeWitt , Gerd Heber, Scientific data management in the coming decade, ACM SIGMOD Record, v.34 n.4, p.34-41, December 2005
[doi> 10.1145/1107499.1107503]
|
| |
13
|
M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007.
|
| |
14
|
R. Lämmel. Google's MapReduce Programming Model - Revisited. Draft; Online since 2 January, 2006; 26 pages, 22 Jan. 2006.
|
| |
15
|
|
| |
16
|
Teradata. Teradata. http://www.teradata.com/t/go.aspx, 2006.
|
| |
17
|
TPC. TPC-H. http://www.tpc.org/tpch/default.asp, 2006.
|
| |
18
|
Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/Redundant Array of Inexpensive Nodes, 2006.
|
CITED BY 9
|
|
Christopher Olston , Benjamin Reed , Utkarsh Srivastava , Ravi Kumar , Andrew Tomkins, Pig latin: a not-so-foreign language for data processing, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
Ronnie Chaiken , Bob Jenkins , Per-Åke Larson , Bill Ramsey , Darren Shakib , Simon Weaver , Jingren Zhou, SCOPE: easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
|
|
|
Bingsheng He , Wenbin Fang , Qiong Luo , Naga K. Govindaraju , Tuyong Wang, Mars: a MapReduce framework on graphics processors, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
Fei Xu , Kevin Beyer , Vuk Ercegovac , Peter J. Haas , Eugene J. Shekita, E = MC3: managing uncertain enterprise data in a cluster-computing environment, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Anastasios Gounaris , Jim Smith , Norman W. Paton , Rizos Sakellariou , Alvaro A. Fernandes , Paul Watson, Adaptive workload allocation in query processing in autonomous heterogeneous environments, Distributed and Parallel Databases, v.25 n.3, p.125-164, June 2009
|
|
|
Yao Zhao , Yinglian Xie , Fang Yu , Qifa Ke , Yuan Yu , Yan Chen , Eliot Gillum, BotGraph: large scale spamming botnet detection, Proceedings of the 6th USENIX symposium on Networked systems design and implementation, p.321-334, April 22-24, 2009, Boston, Massachusetts
|
INDEX TERMS
Primary Classification:
D.
Software
D.1
PROGRAMMING TECHNIQUES
D.1.3
Concurrent Programming
Subjects:
Parallel programming
Additional Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.3
Language Constructs and Features
Subjects:
Frameworks
H.
Information Systems
H.2
DATABASE MANAGEMENT
H.2.4
Systems
Subjects:
Parallel databases;
Relational databases
General Terms:
Design,
Languages,
Management,
Performance,
Reliability
Keywords:
cluster,
data processing,
distributed,
join,
map-reduce,
map-reduce-merge,
parallel,
relational,
search engine
|