ACM Home Page
Please provide us with feedback. Feedback
Fast logistic regression for text categorization with variable-length n-grams
Full text PdfPdf (477 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 354-362  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Georgiana Ifrim  Max-Planck Institute for Informatics, Saarbrücken, Germany
Gökhan Bakir  Google Switzerland GmbH, Zürich, Switzerland
Gerhard Weikum  Max-Planck Institute for Informatics, Saarbrücken, Germany
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 32,   Downloads (12 Months): 272,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401936
What is a DOI?

ABSTRACT

A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all ngrams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
H. Cheng, X. Yan, J. Han, and C.-W. Hsu. Discriminative frequent pattern analysis for effective classification. In Proceedings of ICDE, pages 716--725, 2007.
 
4
J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Computer Journal, 3(40):67--75, 1997.
 
5
O. Dekel, S. Shalev-Shwartz, and Y. Singer. The power of selective memory: Self-bounded learning of prediction suffix trees. In Proceedings of NIPS, Vancouver, Canada, 2004.
6
 
7
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2)(4047499), 2004.
 
8
 
9
A. Genkin, D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3):291--304, 2007.
 
10
J. Goodman. A bit of progress in language modeling. In Technical report. Microsoft Research, 2001.
 
11
J. Goodman. A bit of progress in language modeling. In Technical report. Microsoft Research, 2001.
 
12
 
13
D. Holmes and R. Forsyth. The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 2(10):111--127, 1995.
 
14
G. Ifrim and G. Weikum. Transductive learning for text classification using explicit knowledge models. In Proceedings of PKDD, Springer Lecture Notes in Artificial Intelligence, pages 223--234, Berlin, Germany, 2006.
 
15
R. Jin, R. Yan, J. Zhang, and A. Hauptmann. A faster iterative scaling algorithm for conditional exponential model. In Proceedings of ICML, 2003.
 
16
17
 
18
 
19
P. Komarek and A. Moore. Fast robust logistic regression for large sparse datasets with binary outputs. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, New York, NY, 2003.
20
 
21
T. Kudo. An implementation of freqt (frequent tree miner). http://chasen.org/~taku/software/freqt/, 2003.
 
22
T. Kudo and Y. Matsumoto. A boosting algorithm for classification of semi-structured text. In Proceedings of EMNLP, pages 301--308, Barcelona, Spain, July 2004. Association for Computational Linguistics.
23
 
24
 
25
 
26
J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operation Research and Financial Engineering, 2006.
 
27
 
28
R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270--1278, 2000.
 
29
B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
 
30
S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19:2246--2253, 2003.
31
 
32
33
34
 
35

Collaborative Colleagues:
Georgiana Ifrim: colleagues
Gökhan Bakir: colleagues
Gerhard Weikum: colleagues