| A probabilistic description-oriented approach for categorizing web documents |
| Full text |
Pdf
(904 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the eighth international conference on Information and knowledge management
table of contents
Kansas City, Missouri, United States
Pages: 475 - 482
Year of Publication: 1999
ISBN:1-58113-146-1
|
|
Authors
|
|
Norbert Gövert
|
University of Dortmund
|
|
Mounia Lalmas
|
Department of Computer Science, Queen Mary & Westfield College, University of London and University of Dortmund
|
|
Norbert Fuhr
|
University of Dortmund
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 25, Citation Count: 10
|
|
|
ABSTRACT
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers.Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier.Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
2
|
|
| |
3
|
Fuhr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1), pages 89-100. National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, Md. 20899.
|
| |
4
|
|
| |
5
|
Knorz, G. (1983). Automatisches Indezieren als Erkennen abstrakter Objekte. Niemeyer, T~ibingen.
|
 |
6
|
|
| |
7
|
|
| |
8
|
Schiirmann, J. (1977). Polltnomklassifikatoren fiir die Zeichenerkennung. Ansatz, Adaption, Anwendung. Oldenbourg, Mfinchen, Wien.
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Baoping Zhang , Marcos André Gonçalves , Weiguo Fan , Yuxin Chen , Edward A. Fox , Pável Calado , Marco Cristo, Combining structural and citation-based evidence for text classification, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
|
|
|
|
|
|
Pável Calado , Marco Cristo , Edleno Moura , Nivio Ziviani , Berthier Ribeiro-Neto , Marcos André Gonçalves, Combining link-based and content-based methods for web document classification, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|