|
ABSTRACT
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users.In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only "entry point" to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Lexisnexis http://www.lexisnexis.com.
|
| |
2
|
The Open Directory Project, http://www.dmoz.org.
|
| |
3
|
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
|
| |
4
|
E. Agichtein, P. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.
|
| |
5
|
Article on New York Times. Old Search Engine, the Library, Tries to Fit Into a Google World. Available at: http://www.nytimes.com/2004/06/21/technology/21LIBR.html, June 2004.
|
| |
6
|
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
|
| |
7
|
M. K. Bergman. The deep web: Surfacing hidden value, http://www.press.umich.edu/jep/07-01/bergman.html.
|
| |
8
|
|
| |
9
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
10
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
11
|
|
| |
12
|
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Technical report, UIUC.
|
 |
13
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
14
|
W. Cohen and Y. Singer. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 1996.
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002.
|
 |
20
|
Panagiotis G. Ipeirotis , Luis Gravano , Mehran Sahami, Probe, count, and classify: categorizing hidden web databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.67-78, May 21-24, 2001, Santa Barbara, California, United States
|
 |
21
|
|
| |
22
|
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.
|
| |
23
|
V. Z. Liu, J. C. Richard C. Luo~and, and W. W. Chu. Dpro: A probabilistic approach for hidden web database selection using dynamic probing. In ICDE, 2004.
|
 |
24
|
Xiaoming Liu , Kurt Maly , Mohammad Zubair , Michael L. Nelson, DP9: an OAI gateway service for web crawlers, Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, July 14-18, 2002, Portland, Oregon, USA
[doi> 10.1145/544220.544284]
|
| |
25
|
B. B. Mandelbrot. Fractal Geometry of Nature. W. H. Freeman & Co.
|
 |
26
|
|
| |
27
|
A. Ntoulas, P. Zerfos, and J. Cho. Downloading hidden web content. Technical report, UCLA, 2004.
|
| |
28
|
S. Olsen. Does search engine's power threaten web's independence? http://news.com.com/2009-1023-963618.html.
|
| |
29
|
|
| |
30
|
G. K. Zipf. Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA, 1949.
|
CITED BY 17
|
|
|
|
|
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
Michael L. Nelson , Joan A. Smith , Ignacio Garcia del Campo, Efficient, automatic web resource harvesting, Proceedings of the eighth ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Jeffrey P. Bigham , Anna C. Cavender , Ryan S. Kaminsky , Craig M. Prince , Tyler S. Robison, Transcendence: enabling a personal view of the deep web, Proceedings of the 13th international conference on Intelligent user interfaces, January 13-16, 2008, Gran Canaria, Spain
|
|
|
|
|
|
Robert G. Capra , Christopher A. Lee , Gary Marchionini , Terrell Russell , Chirag Shah , Fred Stutzman, Selection and context scoping for digital video collections: an investigation of youtube and blogs, Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, June 16-20, 2008, Pittsburgh PA, PA, USA
|
|
|
|
|
|
Manuel Álvarez , Juan Raposo , Alberto Pan , Fidel Cacheda , Fernando Bellas , Víctor Carneiro, DeepBot: a focused crawler for accessing hidden web content, Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07), p.18-25, June 12-12, 2007, San Diego, California
|
|
|
|
|
|
|
|
|
Karane Vieira , Luciano Barbosa , Juliana Freire , Altigran Silva, Siphon++: a hidden-webcrawler for keyword-based interfaces, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Peter Mork , Ken Smith , Barbara Blaustein , Chris Wolf , Keri Sarver, Facilitating discovery on the private web using dataset digests, Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, November 24-26, 2008, Linz, Austria
|
|
|
|
|
|
|
|
|
|
|