ACM Home Page
Please provide us with feedback. Feedback
Structure-driven crawler generation by example
Full text PdfPdf (639 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Web 2 table of contents
Pages: 292 - 299  
Year of Publication: 2006
ISBN:1-59593-369-7
Authors
Márcio L. A. Vidal  Universidade Federal do Amazonas, Manaus -- Brazil
Altigran S. da Silva  Universidade Federal do Amazonas, Manaus -- Brazil
Edleno S. de Moura  Universidade Federal do Amazonas, Manaus -- Brazil
João M. B. Cavalcanti  Universidade Federal do Amazonas, Manaus -- Brazil
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 163,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148223
What is a DOI?

ABSTRACT

Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Ahnizeret, K., et al. Information retrieval aware web site modelling and generation. In Proceedings of the 23rd International Conference on Conceptual Modeling (Shanghai, China, 2004), pp. 402--419.
3
4
5
 
6
 
7
8
 
9
10
 
11
12
 
13
14
15
16
 
17
Selkow, S. M. The tree-to-tree editing problem. Information Processing Letters 6 (Dec. 1977), 184--186.
 
18
19


Collaborative Colleagues:
Márcio L. A. Vidal: colleagues
Altigran S. da Silva: colleagues
Edleno S. de Moura: colleagues
João M. B. Cavalcanti: colleagues