| A comparative study of citations and links in document classification |
| Full text |
Pdf
(275 KB)
|
| Source
|
International Conference on Digital Libraries
archive
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Chapel Hill, NC, USA
SESSION: Classification and links
table of contents
Pages: 75 - 84
Year of Publication: 2006
ISBN:1-59593-354-9
|
|
Authors
|
|
Thierson Couto
|
University of Minas Gerais, Belo Horizonte, Brazil
|
|
Marco Cristo
|
University of Minas Gerais, Belo Horizonte, Brazil
|
|
Marcos André Gonçalves
|
University of Minas Gerais, Belo Horizonte, Brazil
|
|
Pável Calado
|
IST/INESC-ID, Lisboa, Portugal
|
|
Nivio Ziviani
|
University of Minas Gerais, Belo Horizonte, Brazil
|
|
Edleno Moura
|
Federal University of Amazonas, Manaus, Brazil
|
|
Berthier Ribeiro-Neto
|
Federal University Minas Gerais, Belo Horizonte, Brazil and Google Engineering, Belo Horizonte, Brazil
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 15, Downloads (12 Months): 98, Citation Count: 2
|
|
|
ABSTRACT
It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, December 1972.
|
| |
2
|
J. Bichtler and E. A. Eaton III. The combined use of bibliographic coupling and cocitation for document retrieval. Journal of the American Society for Information Science, 31(4):278--282, July 1980.
|
| |
3
|
|
| |
4
|
Pável Calado , Marco Cristo , Marcos André Gonçalves , Edleno S. de Moura , Berthier Ribeiro-Neto , Nivio Ziviani, Link-based similarity measures for the classification of Web documents, Journal of the American Society for Information Science and Technology, v.57 n.2, p.208-221, January 2006
[doi> 10.1002/asi.v57:2]
|
 |
5
|
Pável Calado , Marco Cristo , Edleno Moura , Nivio Ziviani , Berthier Ribeiro-Neto , Marcos André Gonçalves, Combining link-based and content-based methods for web document classification, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956938]
|
 |
6
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
7
|
C. Chang and C. J. Lin. Libsvm: a library for support vector machines. 2001.
|
| |
8
|
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.
|
| |
9
|
M. Cristo, P. Calado, E. Moura, and B. R.-N. Nivio Ziviani. Link information as a similarity measure in web classification. In 10th Symposium On String Processing and Information Retrieval SPIRE 2003, volume 2857 of Lecture Notes in Computer Science, pages 43--55, Oct. 2003.
|
| |
10
|
|
| |
11
|
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In Advances in Information Retrieval, 25th European Conference on IR Research, ECIR2003, Proceedings, pages 41--56, April 2003.
|
| |
12
|
|
| |
13
|
E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178 4060):471--479, 1972.
|
 |
14
|
Eric J. Glover , Kostas Tsioutsiouliklis , Steve Lawrence , David M. Pennock , Gary W. Flake, Using web structure for classifying and describing web pages, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511520]
|
| |
15
|
|
| |
16
|
|
| |
17
|
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, January 1963.
|
 |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
Altigran S. da Silva , Eveline A. Veloso , Paulo B. Golghe , Berthier Ribeiro-Neto , Alberto H. F. Laender , Nivio Ziviani, CoBWeb A Crawler for the Brazilian Web, Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, p.184, September 21-24, 1999
|
| |
26
|
H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, July 1973.
|
 |
27
|
|
| |
28
|
|
| |
29
|
|
CITED BY 2
|
|
Eli Cortez , Altigran S. da Silva , Marcos André Gonçalves , Filipe Mesquita , Edleno S. de Moura, FLUX-CIM: flexible unsupervised extraction of citation metadata, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
José A. Camacho-Guerrero , Alex A. Carvalho , Maria G. C. Pimentel , Ethan V. Munson , Alessandra A. Macedo, Clustering as an approach to support the automatic definition of semantic hyperlinks, Proceedings of the eighteenth conference on Hypertext and hypermedia, September 10-12, 2007, Manchester, UK
|
|