|
ABSTRACT
On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Bernhard E. Boser , Isabelle M. Guyon , Vladimir N. Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, p.144-152, July 27-29, 1992, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/130385.130401]
|
| |
3
|
S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference "Recherche d'Information Assistee par Ordinateur", pages 302--313, Paris, FR, 2000.
|
| |
4
|
S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. Category levels in hierarchical text categorization. In Proc. of EMNLP-98, 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain, 1998. Association for Computational Linguistics, Morristown.
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
 |
10
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
11
|
D. Mladenic and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98, Pittsburg, USA, 1998.
|
| |
12
|
J.J. Rocchio. Relevance feedback in information retrieval. In The Smart Retrieval System --- Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, Englewood, Cliffs, New Jersey, 1971.
|
 |
13
|
|
| |
14
|
|
| |
15
|
V. Shanks and H.E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval (SPIRE2001), pages 194--204, San Rafael, Chile, 2001.
|
| |
16
|
A.S. Weigend, E.D. Wiener, and J.O. Pedersen. Exploiting hierarchy in text categorization.
|
| |
17
|
W. Wibowo and H.E. Williams. On using hierarchies for document classification. In Proc. Australian Document Computing Conference, pages 31--37, Coffs Harbour, Australia, 1999.
|
| |
18
|
H.E. Williams and J. Zobel. Searchable words on the web. International Journal of Digital Libraries. To appear.
|
| |
19
|
|
|