ACM Home Page
Please provide us with feedback. Feedback
Distributed aggregation for data-parallel computing: interfaces and implementations
Full text PdfPdf (571 KB)
Source
ACM Symposium on Operating Systems Principles archive
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles table of contents
Big Sky, Montana, USA
SESSION: Clusters table of contents
Pages 247-260  
Year of Publication: 2009
ISBN:978-1-60558-752-3
Authors
Yuan Yu  Microsoft Research, Mountain View, CA, USA
Pradeep Kumar Gunda  Microsoft Research, Mountain View, CA, USA
Michael Isard  Microsoft Research, Mountain View, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGOPS: ACM Special Interest Group on Operating Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 68,   Downloads (12 Months): 68,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1629575.1629600
What is a DOI?

ABSTRACT

Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non-standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest.

This paper evaluates the interfaces and implementations for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user-defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worst-performing choices.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The HIVE project. http://hadoop.apache.org/hive/.
 
2
Database Languages--SQL, ISO/IEC 9075-*:2003, 2003.
 
3
Hadoop wiki. http://wiki.apache.org/hadoop/, April 2008.
 
4
C. Baru and G. Fecteau. An overview of DB2 parallel edition. In International Conference on Management of Data (SIGMOD), pages 460--462, New York, NY, USA, 1995. ACM Press.
 
5
G.E. Blelloch. Programming parallel algorithms. Communications of the ACM (CACM), 39(3):85--97, 1996.
 
6
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. In International Conference of Very Large Data Bases (VLDB), August 2008.
 
7
S. Chaudhuri. An overview of query optimization in relational systems. In PODS '98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 34--43, 1998.
 
8
M. Cole. Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge, MA, USA, 1991.
 
9
T. Cruanes, B. Dageville, and B. Ghosh. Parallel SQL execution in Oracle 10g. In International Conference on Management of Data (SIGMOD), pages 850--854, Paris, France, 2004. ACM.
 
10
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, Dec. 2004.
 
11
D. DeWitt, S. Ghandeharizadeh, D. Schneider, H. Hsiao, A. Bricker, and R. Rasmussen. The Gamma database machine project. IEEE Transactions on Knowledge and Data Engineering, 2(1), 1990.
 
12
D. DeWitt and J. Gray. Parallel database systems: The future of high performance database processing. Communications of the ACM, 36(6), 1992.
 
13
A. Eisenberg, J. Melton, K. Kulkarni, J.-E. Michels, and F. Zemke. Sql:2003 has been published. SIGMOD Rec., 33(1):119--126, 2004.
 
14
G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In International Conference on Management of data (SIGMOD), pages 102--111, New York, NY, USA, 1990. ACM Press.
 
15
G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73--169, 1993.
 
16
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1), 1997.
 
17
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of European Conference on Computer Systems (EuroSys), pages 59--72, March 2007.
 
18
M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In International Conference on Management of Data (SIGMOD), June 29-July 2 2009.
 
19
R. Lämmel. Google's mapreduce programming model -- revisited. Science of Computer Programming, 70(1):1--30, 2008.
 
20
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In International Conference on Management of Data (Industrial Track) (SIGMOD), Vancouver, Canada, June 2008.
 
21
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4):277--298, 2005.
 
22
F. Rabhi and S. Gorlatch. Patterns and Skeletons for Parallel and Distributed Computing. Springer, 2003.
 
23
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA '07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, 2007.
 
24
L.A. Rowe and M.R. Stonebraker. The postgres data model. In International Conference of Very Large Data Bases (VLDB), pages 83--96. Society Press, 1987.
 
25
J. Russell. Oracle9i Application Developer's Guide--Fundamentals. Oracle Corporation, 2002.
 
26
P. Trinder, H.-W. Loidl, and R. Pointon. Parallel and distributed Haskells. Journal of Functional Programming, 12((4&5)):469--510, 2002.
 
27
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P.K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), December 8-10 2008.
 
28
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P.K. Gunda, J. Currey, F. McSherry, and K. Achan. Some sample programs written in DryadLINQ. Technical Report MSR-TR-2008-74, Microsoft Research, May 2008.