ACM Home Page
Please provide us with feedback. Feedback
Pig latin: a not-so-foreign language for data processing
Full text PdfPdf (634 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Industrial Session 2: Database Programming and Performance table of contents
Pages 1099-1110  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Christopher Olston  Yahoo! Research, Santa Clara, CA, USA
Benjamin Reed  Yahoo! Research, Santa Clara, CA, USA
Utkarsh Srivastava  Yahoo! Research, Santa Clara, CA, USA
Ravi Kumar  Yahoo! Research, Santa Clara, CA, USA
Andrew Tomkins  Yahoo! Research, Santa Clara, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 61,   Downloads (12 Months): 519,   Citation Count: 14
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376726
What is a DOI?

ABSTRACT

There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.

We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
5
 
6
Dryad LINQ. http://research.microsoft.com/research/sv/DryadLINQ/, 2007.
 
7
 
8
9
 
10
Hadoop. http://lucene.apache.org/hadoop/, 2007.
 
11
12
 
13
14

CITED BY  14

Collaborative Colleagues:
Christopher Olston: colleagues
Benjamin Reed: colleagues
Utkarsh Srivastava: colleagues
Ravi Kumar: colleagues
Andrew Tomkins: colleagues