| Structure-driven crawler generation by example |
| Full text |
Pdf
(639 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Seattle, Washington, USA
Pages: 292 - 299
Year of Publication: 2006
ISBN:1-59593-369-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 15, Downloads (12 Months): 163, Citation Count: 6
|
|
|
ABSTRACT
Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Ahnizeret, K., et al. Information retrieval aware web site modelling and generation. In Proceedings of the 23rd International Conference on Conceptual Modeling (Shanghai, China, 2004), pp. 402--419.
|
 |
3
|
|
 |
4
|
Pável P. Calado , Altigran S. da Silva , Berthier Ribeiro-Neto , Alberto H. F. Laender , Juliano P. Lage , Davi C. Reis , Pablo A. Roberto , Monique V. Vieira , Marcos A. Gonçalves , Edward A. Fox, Web-DL: an experience in building digital libraries from the web, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
[doi> 10.1145/584792.584916]
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
Hasan Davulcu , Juliana Freire , Michael Kifer , I. V. Ramakrishnan, A layered architecture for querying dynamic Web content, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.491-502, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
11
|
|
 |
12
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988740]
|
| |
13
|
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
Selkow, S. M. The tree-to-tree editing problem. Information Processing Letters 6 (Dec. 1977), 184--186.
|
| |
18
|
|
 |
19
|
|
CITED BY 6
|
|
Yida Wang , Jiang-Ming Yang , Wei Lai , Rui Cai , Lei Zhang , Wei-Ying Ma, Exploring traversal strategy for web forum crawling, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
Lorenzo Blanco , Valter Crescenzi , Paolo Merialdo , Paolo Papotti, Supporting the automatic construction of entity aware search engines, Proceeding of the 10th ACM workshop on Web information and data management, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
Jiang-Ming Yang , Rui Cai , Chunsong Wang , Hua Huang , Lei Zhang , Wei-Ying Ma, Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|