| Pig latin: a not-so-foreign language for data processing |
| Full text |
Pdf
(634 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Industrial Session 2: Database Programming and Performance
table of contents
Pages 1099-1110
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Christopher Olston
|
Yahoo! Research, Santa Clara, CA, USA
|
|
Benjamin Reed
|
Yahoo! Research, Santa Clara, CA, USA
|
|
Utkarsh Srivastava
|
Yahoo! Research, Santa Clara, CA, USA
|
|
Ravi Kumar
|
Yahoo! Research, Santa Clara, CA, USA
|
|
Andrew Tomkins
|
Yahoo! Research, Santa Clara, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 61, Downloads (12 Months): 519, Citation Count: 14
|
|
|
ABSTRACT
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Fay Chang , Jeffrey Dean , Sanjay Ghemawat , Wilson C. Hsieh , Deborah A. Wallach , Mike Burrows , Tushar Chandra , Andrew Fikes , Robert E. Gruber, Bigtable: a distributed storage system for structured data, Proceedings of the 7th symposium on Operating systems design and implementation, November 06-08, 2006, Seattle, Washington
|
 |
3
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, On random sampling over joins, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.263-274, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
4
|
|
 |
5
|
Giuseppe DeCandia , Deniz Hastorun , Madan Jampani , Gunavardhan Kakulapati , Avinash Lakshman , Alex Pilchin , Swaminathan Sivasubramanian , Peter Vosshall , Werner Vogels, Dynamo: amazon's highly available key-value store, Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, October 14-17, 2007, Stevenson, Washington, USA
|
| |
6
|
Dryad LINQ. http://research.microsoft.com/research/sv/DryadLINQ/, 2007.
|
| |
7
|
|
| |
8
|
Jim Gray , Surajit Chaudhuri , Adam Bosworth , Andrew Layman , Don Reichart , Murali Venkatrao , Frank Pellow , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, Data Mining and Knowledge Discovery, v.1 n.1, p.29-53, 1997
[doi> 10.1023/A:1009726021843]
|
 |
9
|
|
| |
10
|
Hadoop. http://lucene.apache.org/hadoop/, 2007.
|
| |
11
|
|
 |
12
|
Michael Isard , Mihai Budiu , Yuan Yu , Andrew Birrell , Dennis Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, March 21-23, 2007, Lisbon, Portugal
|
| |
13
|
|
 |
14
|
|
CITED BY 14
|
|
|
|
|
Ronnie Chaiken , Bob Jenkins , Per-Åke Larson , Bill Ramsey , Darren Shakib , Simon Weaver , Jingren Zhou, SCOPE: easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
|
|
|
David J. DeWitt , Erik Paulson , Eric Robinson , Jeffrey Naughton , Joshua Royalty , Srinath Shankar , Andrew Krioukov, Clustera: an integrated computation and data management system, Proceedings of the VLDB Endowment, v.1 n.1, August 2008
|
|
|
Brian F. Cooper , Raghu Ramakrishnan , Utkarsh Srivastava , Adam Silberstein , Philip Bohannon , Hans-Arno Jacobsen , Nick Puz , Daniel Weaver , Ramana Yerneni, PNUTS: Yahoo!'s hosted data serving platform, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
|
|
|
Andrew Pavlo , Erik Paulson , Alexander Rasin , Daniel J. Abadi , David J. DeWitt , Samuel Madden , Michael Stonebraker, A comparison of approaches to large-scale data analysis, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Anastasios Gounaris , Jim Smith , Norman W. Paton , Rizos Sakellariou , Alvaro A. Fernandes , Paul Watson, Adaptive workload allocation in query processing in autonomous heterogeneous environments, Distributed and Parallel Databases, v.25 n.3, p.125-164, June 2009
|
|
|
Yao Zhao , Yinglian Xie , Fang Yu , Qifa Ke , Yuan Yu , Yan Chen , Eliot Gillum, BotGraph: large scale spamming botnet detection, Proceedings of the 6th USENIX symposium on Networked systems design and implementation, p.321-334, April 22-24, 2009, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|