|
ABSTRACT
This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.
|
| |
2
|
|
 |
3
|
|
 |
4
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
5
|
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.
|
| |
6
|
|
| |
7
|
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41--56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.
|
| |
8
|
|
 |
9
|
Eric J. Glover , Kostas Tsioutsiouliklis , Steve Lawrence , David M. Pennock , Gary W. Flake, Using web structure for classifying and describing web pages, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511520]
|
 |
10
|
Norbert Gövert , Mounia Lalmas , Norbert Fuhr, A probabilistic description-oriented approach for categorizing web documents, Proceedings of the eighth international conference on Information and knowledge management, p.475-482, November 02-06, 1999, Kansas City, Missouri, United States
[doi> 10.1145/319950.320053]
|
| |
11
|
D. Hawking and N. Craswell. Overview of TREC-2001 Web track. In The Tenth Text REtrieval Conference (TREC-2001), pages 61--67, Gaithersburg, Maryland, USA, November 2001.
|
| |
12
|
X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, November 2002.
|
| |
13
|
|
| |
14
|
|
| |
15
|
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, January 1963.
|
 |
16
|
|
 |
17
|
Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , D. Sivakumar , Andrew Tompkins , Eli Upfal, The Web as a graph, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.1-10, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335170]
|
| |
18
|
A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In Proceedings of AAAI/ICML -98, Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998.
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
B. Ribeiro-Neto, I. Silva, and R. Muntz. Soft Computing in Information Retrieval: Techniques and Applications, chapter 11---Bayesian Network Models for IR, pages 259--291. Springer Verlag, 1st edition, 2000.
|
| |
24
|
Altigran S. da Silva , Eveline A. Veloso , Paulo B. Golghe , Berthier Ribeiro-Neto , Alberto H. F. Laender , Nivio Ziviani, CoBWeb A Crawler for the Brazilian Web, Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, p.184, September 21-24, 1999
|
| |
25
|
|
| |
26
|
H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, July 1973.
|
 |
27
|
|
 |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
|
CITED BY 10
|
|
|
|
|
Marcos André Gonçalves , Edward A. Fox , Aaron Krowne , Pável Calado , Alberto H. F. Laender , Altigran S. da Silva , Berthier Ribeiro-Neto, The effectiveness of automatically structured queries in digital libraries, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
Baoping Zhang , Yuxin Chen , Weiguo Fan , Edward A. Fox , Marcos Gonçalves , Marco Cristo , Pável Calado, Intelligent GP fusion from multiple sources for text classification, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
|
|
|
Thierson Couto , Marco Cristo , Marcos André Gonçalves , Pável Calado , Nivio Ziviani , Edleno Moura , Berthier Ribeiro-Neto, A comparative study of citations and links in document classification, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
Jian-Tao Sun , Ben-Yu Zhang , Zheng Chen , Yu-Chang Lu , Chun-Yi Shi , Wei-Ying Ma, GE-CKO: A Method to Optimize Composite Kernels for Web Page Classification, Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, p.299-305, September 20-24, 2004
|
|
|
|
|
|
Adriano Veloso , Wagner Meira, Jr. , Marco Cristo , Marcos Gonçalves , Mohammed Zaki, Multi-evidence, multi-criteria, lazy associative document classification, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Thomaz Philippe C. Silva , Edleno Silva de Moura , João Marcos B. Cavalcanti , Altigran S. da Silva , Moisés Gomes de Carvalho , Marcos André Gonçalves, An evolutionary approach for combining different sources of evidence in search engines, Information Systems, v.34 n.2, p.276-289, April, 2009
|
|
|
|
|