|
ABSTRACT
In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proceedings of SBBD, pages 309--321, 2004.
|
| |
3
|
L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.
|
 |
4
|
|
| |
5
|
L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.
|
| |
6
|
Krishna Bharat , Andrei Broder , Monika Henzinger , Puneet Kumar , Suresh Venkatasubramanian, The connectivity server: fast access to linkage information on the Web, Computer Networks and ISDN Systems, v.30 n.1-7, p.469-477, April 1, 1998
|
| |
7
|
Brightplanet's searchable databases directory. http://www.completeplanet.com.
|
 |
8
|
|
| |
9
|
|
| |
10
|
K.C.C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proceedings of CIDR, pages 44--55, 2005.
|
| |
11
|
|
| |
12
|
|
| |
13
|
M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.
|
| |
14
|
Google Base. http://base.google.com/.
|
 |
15
|
|
 |
16
|
|
| |
17
|
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003.
|
 |
18
|
|
 |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proceedings of CIDR, 2003.
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
CITED BY 4
|
|
|
|
|
|
|
|
|
|
|
Lorenzo Blanco , Valter Crescenzi , Paolo Merialdo , Paolo Papotti, Supporting the automatic construction of entity aware search engines, Proceeding of the 10th ACM workshop on Web information and data management, October 30-30, 2008, Napa Valley, California, USA
|
|