ACM Home Page
Please provide us with feedback. Feedback
Google's Deep Web crawl
Full text PdfPdf (547 KB)
Source
Proceedings of the VLDB Endowment archive
Volume 1 ,  Issue 2  (August 2008) table of contents
SESSION: Industrial, application, and experience sessions: web table of contents
Pages 1241-1252  
Year of Publication: 2008
ISSN:2150-8097
Authors
Jayant Madhavan  Google Inc.
David Ko  Google Inc.
Łucja Kot  Cornell University
Vignesh Ganapathy  Google Inc.
Alex Rasmussen  University of California, San Diego
Alon Halevy  Google Inc.
Publisher
Bibliometrics
Downloads (6 Weeks): 63,   Downloads (12 Months): 474,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1454159.1454163
What is a DOI?

ABSTRACT

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content.

Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
 
2
M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001.
 
3
S. Byers, J. Freire, and C. T. Silva. Efficient acquisition of web data through restricted query interfaces. In WWW Posters, 2001.
4
 
5
Cars.com FAQ. http://siy.cars.com/siy/qsg/faqGeneralInfo.jsp#howmanyads.
 
6
A. Doan, P. Domingos, and A. Y. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In SIGMOD, 2001.
7
8
9
 
10
 
11
 
12
J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR, 2007.
13
 
14
15
 
16
 
17
 
18
19


Collaborative Colleagues:
Jayant Madhavan: colleagues
David Ko: colleagues
Łucja Kot: colleagues
Vignesh Ganapathy: colleagues
Alex Rasmussen: colleagues
Alon Halevy: colleagues