ACM Home Page
Please provide us with feedback. Feedback
Extracting key-substring-group features for text classification
Full text PdfPdf (766 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Philadelphia, PA, USA
SESSION: Research track papers table of contents
Pages: 474 - 483  
Year of Publication: 2006
ISBN:1-59593-339-5
Authors
Dell Zhang  University of London, London, UK
Wee Sun Lee  National University of Singapore, Singapore
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 183,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1150402.1150455
What is a DOI?

ABSTRACT

In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named key-substring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. N. Aizawa. Linguistic techniques to improve the performance of automatic text categorization. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposim (NLPRS), pages 307--314, Tokyo, Japan, 2001.
 
2
 
3
 
4
 
5
A. Bratko and B. Filipic. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia, 2005.
 
6
 
7
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
8
 
9
W. I. Chang and E. L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327--344, 1994.
 
10
11
 
12
 
13
J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Comput Journal, 40(2-3):67--75, 1997.
 
14
J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communication, 32(4):396--402, 1984.
 
15
16
17
 
18
19
 
20
J. Goodman. A bit of progress in language modeling, extended version. Technical report, Microsoft Research, 2001.
 
21
 
22
J. He, A.-H. Tan, and C.-L. Tan. A comparative study on Chinese text categorization methods. In PRICAI'2000 International Workshop on Text and Web Mining, pages 24--35, Melbourne, Australia, 2000.
 
23
 
24
 
25
D. Holmes and R. Forsyth. The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2):111--127, 1995.
 
26
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415--425, 2002.
 
27
P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co, 2002.
 
28
 
29
30
 
31
 
32
 
33
 
34
 
35
John Lafferty , Guy Lebanon, Diffusion Kernels on Statistical Manifolds, The Journal of Machine Learning Research, 6, p.129-163, 9/1/2005
36
37
38
 
39
C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the 7th Pacific Symposium on Biocomputing (PSB), pages 566--575, Lihue, HI, 2002.
 
40
 
41
 
42
 
43
Y. Marton, N. Wu, and L. Hellerstein. On compression-based text classification. In Proceedings of the 27th European Conference on IR Research (ECIR), pages 300--314, Santiago de Compostela, Spain, 2005.
 
44
 
45
R. M. Pampapathi, B. Mirkin, and M. Levene. A suffix tree approach to email filtering, 2005.
 
46
A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw Hill, New York, 2nd edition, 1984.
 
47
48
 
49
 
50
R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270--1278, 2000.
 
51
R. E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2002.
 
52
 
53
B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
54
 
55
 
56
 
57
 
58
W. J. Teahan and D. J. Harper. Using compression based language models for text categorization. In J. Callan, B. Croft, and J. Lafferty, editors, Workshop on Language Modeling and Information Retrieval, pages 83--88, Carnegie Mellon University, 2001.
 
59
E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249--260, 1995.
 
60
 
61
 
62
S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Scholkopf, and J. Vert, editors, Kernels and Bioinformatics. MIT Press, Cambridge, MA, 2004.
 
63
I. H. Witten. Applications of lossless compression in adaptive text mining. In Proceedings of the 34th Annual Conference on Information Sciences and Systems (CISS), Princeton University, New Jersey, 2000.
 
64
 
65
 
66
J. Yang and W. Wang. CLUSEQ: Efficient and effective sequence clustering. In Proceedings of the 19th International Conference on Data Engineering (ICDE), pages 101--112, Bangalore, India, 2003.
67
 
68
 
69
 
70
C. Zhai. Risk Minimization and Language Modeling in Information Retrieval. PhD thesis, Carnegie Mellon University, 2002.
71
72
 
73
D. Zhang and Y. Dong. Semantic, hierarchical, online clustering of web search results. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, 2004.
 
74
G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.