ACM Home Page
Please provide us with feedback. Feedback
Extremely fast text feature extraction for classification and indexing
Full text PdfPdf (325 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: KM: text mining table of contents
Pages 1221-1230  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
George Forman  Hewlett-Packard Labs, Palo Alto, CA, USA
Evan Kirshenbaum  Hewlett-Packard Labs, Palo Alto, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 34,   Downloads (12 Months): 195,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458243
What is a DOI?

ABSTRACT

Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
 
5
Ganchev, K. and Dredze, M. 2008. Small statistical models by random feature mixing. In Workshop on Mobile Language Processing, Annual Meeting of the Association for Computational Linguistics (June 20, 2008). ACL'08.
 
6
7
 
8
 
9
 
10
 
11
Mladenic, D. and Grobelnik, M. 1998. Word sequences as features in text-learning. In Proc. 17th Electrotechnical and Computer Science Conference (ERK98), Slovenia.
 
12
13
 
14
 
15

Collaborative Colleagues:
George Forman: colleagues
Evan Kirshenbaum: colleagues