ACM Home Page
Please provide us with feedback. Feedback
A comparison of approaches to large-scale data analysis
Full text PdfPdf (482 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 5: large-scale data analysis table of contents
Pages 165-178  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Andrew Pavlo  Brown University, Providence, RI, USA
Erik Paulson  University of Wisconsin, Madison, WI, USA
Alexander Rasin  Brown University, Providence, RI, USA
Daniel J. Abadi  Yale University, New Haven, CT, USA
David J. DeWitt  Microsoft Inc., Madison, WI, USA
Samuel Madden  Massachusetts Institute of Technology, Cambridge, MA, USA
Michael Stonebraker  Massachusetts Institute of Technology, Cambridge, MA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 158,   Downloads (12 Months): 569,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559865
What is a DOI?

ABSTRACT

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Hadoop. http://hadoop.apache.org/.
 
2
Hive. http://hadoop.apache.org/hive/.
 
3
Vertica. http://www.vertica.com/.
 
4
Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical report, 1998.
 
5
 
6
Cisco Systems. Cisco Catalyst 3750-E Series Switches Data Sheet, June 2008.
 
7
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. Under Submission, March 2009.
 
8
 
9
 
10
 
11
12
13
14
15
16
17
 
18
R. Rustin, editor. ACM--SIGMOD Workshop on Data Description, Access and Control, May 1974.
 
19
M. Stonebraker. The Case for Shared Nothing. Database Engineering, 9:4--9, 1986.
 
20
M. Stonebraker and J. Hellerstein. What Goes Around Comes Around. In Readings in Database Systems, pages 2--41. The MIT Press, 4th edition, 2005.
 
21

Collaborative Colleagues:
Andrew Pavlo: colleagues
Erik Paulson: colleagues
Alexander Rasin: colleagues
Daniel J. Abadi: colleagues
David J. DeWitt: colleagues
Samuel Madden: colleagues
Michael Stonebraker: colleagues