| A comparison of approaches to large-scale data analysis |
| Full text |
Pdf
(482 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 5: large-scale data analysis
table of contents
Pages 165-178
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
Andrew Pavlo
|
Brown University, Providence, RI, USA
|
|
Erik Paulson
|
University of Wisconsin, Madison, WI, USA
|
|
Alexander Rasin
|
Brown University, Providence, RI, USA
|
|
Daniel J. Abadi
|
Yale University, New Haven, CT, USA
|
|
David J. DeWitt
|
Microsoft Inc., Madison, WI, USA
|
|
Samuel Madden
|
Massachusetts Institute of Technology, Cambridge, MA, USA
|
|
Michael Stonebraker
|
Massachusetts Institute of Technology, Cambridge, MA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 158, Downloads (12 Months): 569, Citation Count: 0
|
|
|
ABSTRACT
There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Hadoop. http://hadoop.apache.org/.
|
| |
2
|
Hive. http://hadoop.apache.org/hive/.
|
| |
3
|
Vertica. http://www.vertica.com/.
|
| |
4
|
Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical report, 1998.
|
| |
5
|
Ronnie Chaiken , Bob Jenkins , Per-Åke Larson , Bill Ramsey , Darren Shakib , Simon Weaver , Jingren Zhou, SCOPE: easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
[doi> 10.1145/1454159.1454166]
|
| |
6
|
Cisco Systems. Cisco Catalyst 3750-E Series Switches Data Sheet, June 2008.
|
| |
7
|
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. Under Submission, March 2009.
|
| |
8
|
|
| |
9
|
|
| |
10
|
David J. DeWitt , Robert H. Gerber , Goetz Graefe , Michael L. Heytens , Krishna B. Kumar , M. Muralikrishna, GAMMA - A High Performance Dataflow Database Machine, Proceedings of the 12th International Conference on Very Large Data Bases, p.228-237, August 25-28, 1986
|
| |
11
|
|
 |
12
|
|
 |
13
|
Michael Isard , Mihai Budiu , Yuan Yu , Andrew Birrell , Dennis Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, March 21-23, 2007, Lisbon, Portugal
|
 |
14
|
Erik Meijer , Brian Beckman , Gavin Bierman, LINQ: reconciling object, relations and XML in the .NET framework, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142552]
|
 |
15
|
Christopher Olston , Benjamin Reed , Utkarsh Srivastava , Ravi Kumar , Andrew Tomkins, Pig latin: a not-so-foreign language for data processing, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376726]
|
 |
16
|
|
 |
17
|
|
| |
18
|
R. Rustin, editor. ACM--SIGMOD Workshop on Data Description, Access and Control, May 1974.
|
| |
19
|
M. Stonebraker. The Case for Shared Nothing. Database Engineering, 9:4--9, 1986.
|
| |
20
|
M. Stonebraker and J. Hellerstein. What Goes Around Comes Around. In Readings in Database Systems, pages 2--41. The MIT Press, 4th edition, 2005.
|
| |
21
|
|
|