ACM Home Page
Please provide us with feedback. Feedback
Web-scale extraction of structured data
Full text PdfPdf (697 KB)
Source
ACM SIGMOD Record archive
Volume 37 ,  Issue 4  (December 2008) table of contents
COLUMN: Special section on managing information extraction table of contents
Pages 55-61  
Year of Publication: 2009
ISSN:0163-5808
Authors
Michael J. Cafarella  University of Washington
Jayant Madhavan  Google Inc.
Alon Halevy  Google Inc.
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 106,   Downloads (12 Months): 313,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1519103.1519112
What is a DOI?

ABSTRACT

A long-standing goal of Web research has been to construct a unified Web knowledge base. Information extraction techniques have shown good results on Web inputs, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (two of which come from Google Research). The TextRunner system focuses on raw natural language text, the WebTables system focuses on HTML-embedded tables, and the deep-web surfacing system focuses on "hidden" databases. The domain, expressiveness, and accuracy of extracted data can depend strongly on its source extractor; we describe differences in the characteristics of data produced by the three extractors. Finally, we discuss a series of unique data applications (some of which have already been prototyped) that are enabled by aggregating extractedWeb information.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
M. Banko. Personal Communication, 2008.
 
3
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, 2007.
 
4
M. Banko and O. Etzioni. The Tradeoffs Between Open and Traditional Relational Extraction. In ACL, 2008.
 
5
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
 
6
M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001.
 
7
 
8
M. J. Cafarella, A. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the Relational Web. In WebDB, 2008.
 
9
10
 
11
Cars.com FAQ. http://siy.cars.com/siy/qsg/faqGeneralInfo.jsp#howmanyads.
 
12
Cazoodle Apartment Search. http://apartments.cazoodle.com/.
 
13
K. C.-C. Chang, B. He, and Z. Zhang. MetaQuerier over the Deep Web: Shallow Integration across Holistic Sources. In VLDB-IIWeb, 2004.
14
15
16
 
17
 
18
 
19
 
20
J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR, 2007.
 
21
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's Deep-Web Crawl. In VLDB, 2008.
 
22
A. Ntoulas, P. Zerfos, and J. Cho. Downloading Textual Hidden Web Content through Keyword Queries. In JCDL, 2005.
23
24
 
25
P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In EMNLP, 2008.
 
26
Trulia. http://www.trulia.com/.
27
28

Collaborative Colleagues:
Michael J. Cafarella: colleagues
Jayant Madhavan: colleagues
Alon Halevy: colleagues