ACM Home Page
Please provide us with feedback. Feedback
Downloading textual hidden web content through keyword queries
Full text PdfPdf (278 KB)
Source International Conference on Digital Libraries archive
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries table of contents
Denver, CO, USA
SESSION: Tools & techniques track: searching and IR table of contents
Pages: 100 - 109  
Year of Publication: 2005
ISBN:1-58113-876-8
Authors
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 131,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1065385.1065407
What is a DOI?

ABSTRACT

An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users.In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only "entry point" to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Lexisnexis http://www.lexisnexis.com.
 
2
The Open Directory Project, http://www.dmoz.org.
 
3
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
 
4
E. Agichtein, P. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.
 
5
Article on New York Times. Old Search Engine, the Library, Tries to Fit Into a Google World. Available at: http://www.nytimes.com/2004/06/21/technology/21LIBR.html, June 2004.
 
6
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
 
7
M. K. Bergman. The deep web: Surfacing hidden value, http://www.press.umich.edu/jep/07-01/bergman.html.
 
8
 
9
10
11
 
12
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Technical report, UIUC.
13
 
14
W. Cohen and Y. Singer. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 1996.
 
15
 
16
17
18
 
19
P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002.
20
21
 
22
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.
 
23
V. Z. Liu, J. C. Richard C. Luo~and, and W. W. Chu. Dpro: A probabilistic approach for hidden web database selection using dynamic probing. In ICDE, 2004.
24
 
25
B. B. Mandelbrot. Fractal Geometry of Nature. W. H. Freeman & Co.
26
 
27
A. Ntoulas, P. Zerfos, and J. Cho. Downloading hidden web content. Technical report, UCLA, 2004.
 
28
S. Olsen. Does search engine's power threaten web's independence? http://news.com.com/2009-1023-963618.html.
 
29
 
30
G. K. Zipf. Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA, 1949.

CITED BY  17

Collaborative Colleagues:
Alexandros Ntoulas: colleagues
Petros Zerfos: colleagues
Junghoo Cho: colleagues