ACM Home Page
Please provide us with feedback. Feedback
Combining link-based and content-based methods for web document classification
Full text PdfPdf (206 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the twelfth international conference on Information and knowledge management table of contents
New Orleans, LA, USA
SESSION: Information retrieval session 7: web table of contents
Pages: 394 - 401  
Year of Publication: 2003
ISBN:1-58113-723-0
Authors
Pável Calado  Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
Marco Cristo  Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
Edleno Moura  Fed. Univ. of Amazonas, Manaus, Brazil
Nivio Ziviani  Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
Berthier Ribeiro-Neto  Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
Marcos André Gonçalves  Virginia Tech, VA
Sponsors
ACM: Association for Computing Machinery
SIGMIS: ACM Special Interest Group on Management Information Systems
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 85,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956863.956938
What is a DOI?

ABSTRACT

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.
 
2
3
4
 
5
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.
 
6
 
7
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41--56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.
 
8
9
10
 
11
D. Hawking and N. Craswell. Overview of TREC-2001 Web track. In The Tenth Text REtrieval Conference (TREC-2001), pages 61--67, Gaithersburg, Maryland, USA, November 2001.
 
12
X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, November 2002.
 
13
 
14
 
15
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, January 1963.
16
17
 
18
A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In Proceedings of AAAI/ICML -98, Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998.
 
19
20
 
21
22
 
23
B. Ribeiro-Neto, I. Silva, and R. Muntz. Soft Computing in Information Retrieval: Techniques and Applications, chapter 11---Bayesian Network Models for IR, pages 259--291. Springer Verlag, 1st edition, 2000.
 
24
 
25
 
26
H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, July 1973.
27
28
 
29
30
 
31
 
32

CITED BY  10

Collaborative Colleagues:
Pável Calado: colleagues
Marco Cristo: colleagues
Edleno Moura: colleagues
Nivio Ziviani: colleagues
Berthier Ribeiro-Neto: colleagues
Marcos André Gonçalves: colleagues