ACM Home Page
Please provide us with feedback. Feedback
Machine learning in automated text categorization
Full text PdfPdf (524 KB)
Source ACM Computing Surveys (CSUR) archive
Volume 34 ,  Issue 1  (March 2002) table of contents
Pages: 1 - 47  
Year of Publication: 2002
ISSN:0360-0300
Author
Fabrizio Sebastiani  Consiglio Nazionale delle Ricerche, Pisa, Italy
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 305,   Downloads (12 Months): 1975,   Citation Count: 380
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/505282.505283
What is a DOI?

ABSTRACT

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
ATTARDI, G., DI MARCO,S.,AND SALVI, D. 1998. Categorization by context. J. Univers. Comput. Sci. 4, 9, 719-736.
5
6
7
8
 
9
 
10
CAVNAR,W.B.AND TRENKLE, J. M. 1994. N-grambased text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Docu-ment Analysis and Information Retrieval (Las Vegas, NV, 1994), 161-175.
 
11
12
13
 
14
 
15
COHEN, W. W. 1995a. Learning to classify English text with ILP methods. In Advances in Inductive Logic Programming, L. De Raedt, ed. IOS Press, Amsterdam, The Netherlands, 124-143.
 
16
COHEN, W. W. 1995b. Text categorization and relational learning. In Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 124-132.
 
17
COHEN,W.W.AND HIRSH, H. 1998. Joins that generalize: text classification using WHIRL.InProceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining (New York, NY, 1998), 169-173.
18
19
20
21
 
22
DAGAN, I., KAROV,Y.,AND ROTH, D. 1997. Mistakedriven learning in text categorization. In Proceedings of EMNLP-97, 2nd Conference on Empirical Methods in Natural Language Processing (Providence, RI, 1997), 55-63.
 
23
DEERWESTER, S., DUMAIS,S.T.,FURNAS,G.W., LANDAUER, T. K., AND HARSHMAN, R. 1990. Indexing by latent semantic indexing. J. Amer. Soc. Inform. Sci. 41, 6, 391-407.
 
24
DENOYER, L., ZARAGOZA, H., AND GALLINARI, P. 2001. HMM-based passage models for document classification and ranking. In Proceedings of ECIR- 01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).
 
25
DIAZ ESTEBAN, A., DE BUENAGA RODRIGUEZ, M., URENA LOPEZ,L.A.,AND GARCIA VEGA, M. 1998. Integrating linguistic resources in an uniform way for text classification tasks. In Proceedings of LREC-98, 1st International Conference on Language Resources and Evaluation (Grenada, Spain, 1998), 1197-1204.
 
26
 
27
DRUCKER, H., VAPNIK,V.,AND WU, D. 1999. Automatic text categorization and its applications to text retrieval. IEEE Trans. Neural Netw. 10,5, 1048-1054.
28
29
 
30
 
31
FIELD, B. 1975. Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. J. Document. 31, 4, 246-265.
 
32
FORSYTH, R. S. 1999. New directions in text categorization. In Causal Models and Intelligent Data Management, A. Gammerman, ed. Springer, Heidelberg, Germany, 151-185.
 
33
 
34
FUHR, N. 1985. Aprobabilistic model of dictionarybased automatic indexing. In Proceedings of RIAO-85, 1st International Conference "Re-cherche d'Information Assistee par Ordinateur" (Grenoble, France, 1985), 207-216.
 
35
36
 
37
FUHR, N., HARTMANN, S., KNORZ, G., LUSTIG,G., SCHWANTNER, M., AND TZERAS, K. 1991. AIR/X"a rule-based multistage indexing system for large subject fields. In Proceedings of RIAO-91, 3rd International Conference "Recherche d'Information Assistee par Ordina-teur" (Barcelona, Spain, 1991), 606-623.
 
38
39
 
40
 
41
 
42
GALE, W. A., CHURCH,K.W.,AND YAROWSKY, D. 1993. A method for disambiguating word senses in a large corpus. Comput. Human. 26, 5, 415-439.
43
 
44
GRAY,W.A.AND HARLEY, A. J. 1971. Computerassisted indexing. Inform. Storage Retrieval 7, 4, 167-174.
 
45
 
46
 
47
HEAPS, H. 1973. A theory of relevance for automatic document classification. Inform. Control 22, 3, 268-278.
 
48
 
49
50
 
51
ITTNER,D.J.,LEWIS,D.D.,AND AHN, D. D. 1995. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), 301-315.
52
53
 
54
 
55
 
56
 
57
 
58
JOHN, G. H., KOHAVI, R., AND PFLEGER, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ, 1994), 121-129.
 
59
JUNKER,M.AND ABECKER, A. 1997. Exploiting thesaurus knowledge in rule induction for text classification. In Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing (Tzigov Chark, Bulgaria, 1997), 202-207.
 
60
JUNKER,M.AND HOCH, R. 1998. An experimental evaluation of OCR text representations for learning document classifiers. Internat. J. Document Analysis and Recognition 1, 2, 116-122.
 
61
62
 
63
64
 
65
 
66
 
67
 
68
69
 
70
LAM, W., LOW,K.F.,AND HO, C. Y. 1997. Using a Bayesian network induction approach for text categorization. In Proceedings of IJCAI-97, 15th International Joint Conference on Artificial Intelligence (Nagoya, Japan, 1997), 745-750.
 
71
 
72
LANG, K. 1995. NEWSWEEDER: learning to filter netnews. In Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 331-339.
73
74
75
76
 
77
78
79
 
80
LEWIS, D. D. 1995c. The TREC-4 filtering track: description and analysis. In Proceedings of TREC-4, 4th Text Retrieval Conference (Gaithersburg, MD, 1995), 165-180.
 
81
 
82
LEWIS,D.D.AND CATLETT, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ, 1994), 148-156.
 
83
 
84
LEWIS,D.D.AND HAYES, P. J. 1994. Guest editorial for the special issue on text categorization. ACM Trans. Inform. Syst. 12, 3, 231.
 
85
LEWIS,D.D.AND RINGUETTE, M. 1994. A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1994), 81-93.
86
87
 
88
LI,Y.H.AND JAIN, A. K. 1998. Classification of text documents. Comput. J. 41, 8, 537-546.
89
 
90
LIERE,R.AND TADEPALLI, P. 1997. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence (Providence, RI, 1997), 591-596.
91
 
92
93
 
94
95
 
96
 
97
 
98
MERKL, D. 1998. Text classification with selforganizing maps: Some lessons learned. Neurocomputing 21, 1/3, 61-77.
 
99
 
100
 
101
MLADENIC,D.AND GROBELNIK, M. 1998. Word sequences as features in text-learning. In Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference (Ljubljana, Slovenia, 1998), 145-148.
 
102
 
103
MOULINIER, I., RASKINIS,G.,AND GANASCIA, J.-G. 1996. Text categorization: a symbolic approach. In Proceedings of SDAIR-96, 5th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1996), 87-99.
 
104
105
 
106
107
 
108
109
110
 
111
ROBERTSON,S.E.AND HARDING, P. 1984. Probabilistic automatic indexing by learning from human indexers. J. Document. 40, 4, 264-270.
 
112
ROBERTSON,S.E.AND SPARCK JONES, K. 1976. Relevance weighting of search terms. J. Amer. Soc. Inform. Sci. 27, 3, 129-146. Also reprinted in Willett {1988}, pp. 143-160.
 
113
114
 
115
SABLE,C.L.AND HATZIVASSILOGLOU, V. 2000. Textbased approaches for non-topical image categorization. Internat. J. Dig. Libr. 3, 3, 261-275.
 
116
117
 
118
 
119
120
 
121
122
 
123
124
125
 
126
 
127
SLONIM,N.AND TISHBY, N. 2001. The power of word clusters for text classification. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).
 
128
 
129
 
130
 
131
TUMER,K.AND GHOSH, J. 1996. Error correlation and error reduction in ensemble classifiers. Connection Sci. 8, 3-4, 385-403.
132
 
133
VAN RIJSBERGEN, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Document. 33, 2, 106-119.
 
134
 
135
 
136
 
137
WIENER,E.D.,PEDERSEN,J.O.,AND WEIGEND,A.S. 1995. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), 317-332.
 
138
139
 
140
141
 
142
143
144
 
145
 
146
147

CITED BY  380

Collaborative Colleagues:
Fabrizio Sebastiani: colleagues