| The impact of term selection in genre-aware focused crawling |
| Full text |
Pdf
(155 KB)
|
| Source
|
Symposium on Applied Computing
archive
Proceedings of the 2008 ACM symposium on Applied computing
table of contents
Fortaleza, Ceara, Brazil
SESSION: Information access and retrieval
table of contents
Pages 1158-1163
Year of Publication: 2008
ISBN:978-1-59593-753-7
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 3, Downloads (12 Months): 39, Citation Count: 0
|
|
|
ABSTRACT
The genre-aware approach to focused crawling aims at crawling pages related to specific topics that can be expressed in terms of both genre and content information. Such an approach requires an expert to specify a set of terms that describe the genre and the content of the pages of interest. In this paper, we analyze the impact of term selection on this approach. Thus, we have performed an experimental study in which we vary the number of genre and content terms used in focused crawling processes aimed at crawling pages related to syllabi (genre) of computer science courses (subject) and sale offers (genre) of computer equipments (subject). This experimental study showed that a small set of terms selected by an expert is usually enough to produce good results. In addition, we propose and experimentally evaluate a strategy for semi-automatic generation of terms to be used in such an approach. The results of these experiments showed that such a strategy is very effective and provides a means to assist an expert in the task of specifying the sets of required terms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
G. T. Assis, A. H. F. Laender, M. A. Gonçalves and A. S. Silva. Exploiting Genre in Focused Crawling. In Proc. of the 14th Symposium on String Processing and Information Retrieval, Santiago, Chile, 2007, pp. 49--60.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
K. Lagus and S. Kaski. Keyword Selection Method for Characterizing Text Document Maps. In Proc. of the 9th International Conference on Artificial Neural Networks, Edinburgh, UK, 1999, pp. 371--376.
|
| |
7
|
|
 |
8
|
|
 |
9
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
10
|
G. Pant and F. Menczer. Topical Crawling for Business Intelligence. In Proc. of the 7th European Conference on Research and Advanced Technology for Digital Libraries, Trodheim, Norway, 2003, pp. 233--244.
|
| |
11
|
|
 |
12
|
|
 |
13
|
Gautam Pant , Kostas Tsioutsiouliklis , Judy Johnson , C. Lee Giles, Panorama: extending digital libraries with topical crawlers, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
[doi> 10.1145/996350.996384]
|
| |
14
|
|
|