ACM Home Page
Please provide us with feedback. Feedback
The impact of term selection in genre-aware focused crawling
Full text PdfPdf (155 KB)
Source Symposium on Applied Computing archive
Proceedings of the 2008 ACM symposium on Applied computing table of contents
Fortaleza, Ceara, Brazil
SESSION: Information access and retrieval table of contents
Pages 1158-1163  
Year of Publication: 2008
ISBN:978-1-59593-753-7
Authors
Guilherme T. de Assis  Federal University of Minas Gerais, Belo Horizonte MG Brazil
Alberto H. F. Laender  Federal University of Minas Gerais, Belo Horizonte MG Brazil
Altigran S. da Silva  Federal University of Amazonas, Manaus AM Brazil
Marcos André Gonçalves  Federal University of Minas Gerais, Belo Horizonte MG Brazil
Sponsor
SIGAPP: ACM Special Interest Group on Applied Computing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 39,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1363686.1363953
What is a DOI?

ABSTRACT

The genre-aware approach to focused crawling aims at crawling pages related to specific topics that can be expressed in terms of both genre and content information. Such an approach requires an expert to specify a set of terms that describe the genre and the content of the pages of interest. In this paper, we analyze the impact of term selection on this approach. Thus, we have performed an experimental study in which we vary the number of genre and content terms used in focused crawling processes aimed at crawling pages related to syllabi (genre) of computer science courses (subject) and sale offers (genre) of computer equipments (subject). This experimental study showed that a small set of terms selected by an expert is usually enough to produce good results. In addition, we propose and experimentally evaluate a strategy for semi-automatic generation of terms to be used in such an approach. The results of these experiments showed that such a strategy is very effective and provides a means to assist an expert in the task of specifying the sets of required terms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
G. T. Assis, A. H. F. Laender, M. A. Gonçalves and A. S. Silva. Exploiting Genre in Focused Crawling. In Proc. of the 14th Symposium on String Processing and Information Retrieval, Santiago, Chile, 2007, pp. 49--60.
 
2
 
3
 
4
 
5
 
6
K. Lagus and S. Kaski. Keyword Selection Method for Characterizing Text Document Maps. In Proc. of the 9th International Conference on Artificial Neural Networks, Edinburgh, UK, 1999, pp. 371--376.
 
7
8
9
 
10
G. Pant and F. Menczer. Topical Crawling for Business Intelligence. In Proc. of the 7th European Conference on Research and Advanced Technology for Digital Libraries, Trodheim, Norway, 2003, pp. 233--244.
 
11
12
13
 
14

Collaborative Colleagues:
Guilherme T. de Assis: colleagues
Alberto H. F. Laender: colleagues
Altigran S. da Silva: colleagues
Marcos André Gonçalves: colleagues