|
ABSTRACT
The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
|
| |
2
|
M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001.
|
| |
3
|
S. Byers, J. Freire, and C. T. Silva. Efficient acquisition of web data through restricted query interfaces. In WWW Posters, 2001.
|
 |
4
|
|
| |
5
|
Cars.com FAQ. http://siy.cars.com/siy/qsg/faqGeneralInfo.jsp#howmanyads.
|
| |
6
|
A. Doan, P. Domingos, and A. Y. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In SIGMOD, 2001.
|
 |
7
|
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR, 2007.
|
 |
13
|
|
| |
14
|
|
 |
15
|
Anand Rajaraman , Yehoshua Sagiv , Jeffrey D. Ullman, Answering queries using templates with binding patterns (extended abstract), Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.105-112, May 22-25, 1995, San Jose, California, United States
[doi> 10.1145/212433.220199]
|
| |
16
|
|
| |
17
|
Jiying Wang , Ji-Rong Wen , Fred Lochovsky , Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing, Proceedings of the Thirtieth international conference on Very large data bases, p.408-419, August 31-September 03, 2004, Toronto, Canada
|
| |
18
|
|
 |
19
|
|
CITED BY 4
|
|
Arjun Dasgupta , Nan Zhang , Gautam Das , Surajit Chaudhuri, Privacy preservation of aggregates in hidden databases: why and how?, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
|
|