ACM Home Page
Please provide us with feedback. Feedback
A probabilistic description-oriented approach for categorizing web documents
Full text PdfPdf (904 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the eighth international conference on Information and knowledge management table of contents
Kansas City, Missouri, United States
Pages: 475 - 482  
Year of Publication: 1999
ISBN:1-58113-146-1
Authors
Norbert Gövert  University of Dortmund
Mounia Lalmas  Department of Computer Science, Queen Mary & Westfield College, University of London and University of Dortmund
Norbert Fuhr  University of Dortmund
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGIR: ACM Special Interest Group on Information Retrieval
SIGMIS: ACM Special Interest Group on Management Information Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 25,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/319950.320053
What is a DOI?

ABSTRACT

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers.Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier.Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
Fuhr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1), pages 89-100. National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, Md. 20899.
 
4
 
5
Knorz, G. (1983). Automatisches Indezieren als Erkennen abstrakter Objekte. Niemeyer, T~ibingen.
6
 
7
 
8
Schiirmann, J. (1977). Polltnomklassifikatoren fiir die Zeichenerkennung. Ansatz, Adaption, Anwendung. Oldenbourg, Mfinchen, Wien.
9
 
10
 
11

CITED BY  10

Collaborative Colleagues:
Norbert Gövert: colleagues
Mounia Lalmas: colleagues
Norbert Fuhr: colleagues