|
ABSTRACT
A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all ngrams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
H. Cheng, X. Yan, J. Han, and C.-W. Hsu. Discriminative frequent pattern analysis for effective classification. In Proceedings of ICDE, pages 716--725, 2007.
|
| |
4
|
J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Computer Journal, 3(40):67--75, 1997.
|
| |
5
|
O. Dekel, S. Shalev-Shwartz, and Y. Singer. The power of selective memory: Self-bounded learning of prediction suffix trees. In Proceedings of NIPS, Vancouver, Canada, 2004.
|
 |
6
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
7
|
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2)(4047499), 2004.
|
| |
8
|
|
| |
9
|
A. Genkin, D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3):291--304, 2007.
|
| |
10
|
J. Goodman. A bit of progress in language modeling. In Technical report. Microsoft Research, 2001.
|
| |
11
|
J. Goodman. A bit of progress in language modeling. In Technical report. Microsoft Research, 2001.
|
| |
12
|
|
| |
13
|
D. Holmes and R. Forsyth. The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 2(10):111--127, 1995.
|
| |
14
|
G. Ifrim and G. Weikum. Transductive learning for text classification using explicit knowledge models. In Proceedings of PKDD, Springer Lecture Notes in Artificial Intelligence, pages 223--234, Berlin, Germany, 2006.
|
| |
15
|
R. Jin, R. Yan, J. Zhang, and A. Hauptmann. A faster iterative scaling algorithm for conditional exponential model. In Proceedings of ICML, 2003.
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
P. Komarek and A. Moore. Fast robust logistic regression for large sparse datasets with binary outputs. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, New York, NY, 2003.
|
 |
20
|
|
| |
21
|
T. Kudo. An implementation of freqt (frequent tree miner). http://chasen.org/~taku/software/freqt/, 2003.
|
| |
22
|
T. Kudo and Y. Matsumoto. A boosting algorithm for classification of semi-structured text. In Proceedings of EMNLP, pages 301--308, Barcelona, Spain, July 2004. Association for Computational Linguistics.
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operation Research and Financial Engineering, 2006.
|
| |
27
|
|
| |
28
|
R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270--1278, 2000.
|
| |
29
|
B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
|
| |
30
|
S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19:2246--2253, 2003.
|
 |
31
|
|
| |
32
|
|
 |
33
|
|
 |
34
|
|
| |
35
|
|
|