|
ABSTRACT
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named key-substring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. N. Aizawa. Linguistic techniques to improve the performance of automatic text categorization. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposim (NLPRS), pages 307--314, Tokyo, Japan, 2001.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
A. Bratko and B. Filipic. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia, 2005.
|
| |
6
|
|
| |
7
|
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
|
 |
8
|
|
| |
9
|
W. I. Chang and E. L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327--344, 1994.
|
| |
10
|
|
 |
11
|
|
| |
12
|
Kenneth Ward Church , Patrick Hanks, Word association norms, mutual information, and lexicography, Proceedings of the 27th annual meeting on Association for Computational Linguistics, p.76-83, June 26-29, 1989, Vancouver, British Columbia, Canada
[doi> 10.3115/981623.981633]
|
| |
13
|
J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Comput Journal, 40(2-3):67--75, 1997.
|
| |
14
|
J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communication, 32(4):396--402, 1984.
|
| |
15
|
|
 |
16
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
 |
17
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
18
|
|
 |
19
|
|
| |
20
|
J. Goodman. A bit of progress in language modeling, extended version. Technical report, Microsoft Research, 2001.
|
| |
21
|
|
| |
22
|
J. He, A.-H. Tan, and C.-L. Tan. A comparative study on Chinese text categorization methods. In PRICAI'2000 International Workshop on Text and Web Mining, pages 24--35, Melbourne, Australia, 2000.
|
| |
23
|
|
| |
24
|
|
| |
25
|
D. Holmes and R. Forsyth. The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2):111--127, 1995.
|
| |
26
|
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415--425, 2002.
|
| |
27
|
P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co, 2002.
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
| |
35
|
John Lafferty , Guy Lebanon, Diffusion Kernels on Statistical Manifolds, The Journal of Machine Learning Research, 6, p.129-163, 9/1/2005
|
 |
36
|
John Lafferty , Chengxiang Zhai, Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.111-119, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383970]
|
 |
37
|
|
 |
38
|
|
| |
39
|
C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the 7th Pacific Symposium on Biocomputing (PSB), pages 566--575, Lihue, HI, 2002.
|
| |
40
|
|
| |
41
|
|
| |
42
|
|
| |
43
|
Y. Marton, N. Wu, and L. Hellerstein. On compression-based text classification. In Proceedings of the 27th European Conference on IR Research (ECIR), pages 300--314, Santiago de Compostela, Spain, 2005.
|
| |
44
|
|
| |
45
|
R. M. Pampapathi, B. Mirkin, and M. Levene. A suffix tree approach to email filtering, 2005.
|
| |
46
|
A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw Hill, New York, 2nd edition, 1984.
|
| |
47
|
|
 |
48
|
|
| |
49
|
|
| |
50
|
R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270--1278, 2000.
|
| |
51
|
R. E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2002.
|
| |
52
|
|
| |
53
|
B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
|
 |
54
|
|
| |
55
|
|
| |
56
|
|
| |
57
|
|
| |
58
|
W. J. Teahan and D. J. Harper. Using compression based language models for text categorization. In J. Callan, B. Croft, and J. Lafferty, editors, Workshop on Language Modeling and Information Retrieval, pages 83--88, Carnegie Mellon University, 2001.
|
| |
59
|
E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249--260, 1995.
|
| |
60
|
|
| |
61
|
|
| |
62
|
S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Scholkopf, and J. Vert, editors, Kernels and Bioinformatics. MIT Press, Cambridge, MA, 2004.
|
| |
63
|
I. H. Witten. Applications of lossless compression in adaptive text mining. In Proceedings of the 34th Annual Conference on Information Sciences and Systems (CISS), Princeton University, New Jersey, 2000.
|
| |
64
|
|
| |
65
|
|
| |
66
|
J. Yang and W. Wang. CLUSEQ: Efficient and effective sequence clustering. In Proceedings of the 19th International Conference on Data Engineering (ICDE), pages 101--112, Bangalore, India, 2003.
|
 |
67
|
|
| |
68
|
|
| |
69
|
|
| |
70
|
C. Zhai. Risk Minimization and Language Modeling in Information Retrieval. PhD thesis, Carnegie Mellon University, 2002.
|
 |
71
|
|
 |
72
|
|
| |
73
|
D. Zhang and Y. Dong. Semantic, hierarchical, online clustering of web search results. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, 2004.
|
| |
74
|
G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.
|
|