ACM Home Page
Please provide us with feedback. Feedback
Real-time data pre-processing technique for efficient feature extraction in large scale datasets
Full text PdfPdf (330 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: KM: feature selection table of contents
Pages 981-990  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Ying Liu  The Pennsylvania State University, University Park, PA, USA
Lucian V. Lita  Siemens Medical Solutions, Malven, PA, USA
R. Stefan Niculescu  Siemens Medical Solutions, Malven, PA, USA
Kun Bai  The Pennsylvania State University, University Park, PA, USA
Prasenjit Mitra  The Pennsylvania State University, University Park, PA, USA
C. Lee Giles  The Pennsylvania State University, University Park, PA, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 163,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458211
What is a DOI?

ABSTRACT

Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/genomicsequence/chromosomes/gcg/.
 
2
 
3
 
4
 
5
6
 
7
8
9
10
11
12
 
13
V. P. Donald Knuth; James H. Morris, Jr. Fast pattern matching in strings. In SIAM Journal on Computing, pages 323--350, 1977.
 
14
15
16
 
17
S. Kim and Y. Kim. A fast multiple string-pattern matching algorithm. In Proc. of 17th AoM/IAoM Conference on Computer Science, Aug. 1999.
 
18
Y. Liu, L. V. Lita, S. Niculescu, P. Mitra, and C. L. Giles. Finding a haystack in haystacks - simultaneous identificcation ofconcepts in large bio-medical corpora. SIAM SDM 2008.
 
19
U. Manber. Agrep, an approximate grep. In http://www.tgries.de/agrep/, 2005.
20
 
21
 
22
U. M. Sun Wu. A fast algorithm for multi-pattern searching. In Technical Report TR 94-17, University of Arizona at Tuscon, May 1994.
23
 
24
B. W. Watson and R. E. Watson. A new family of string pattern matching algorithms. South African Computer Journal, 30:34--41, 2003.
 
25
S. Wu and U. Manber. Fast text searching with errors. Technical Report TR-91-11, 1991.

Collaborative Colleagues:
Ying Liu: colleagues
Lucian V. Lita: colleagues
R. Stefan Niculescu: colleagues
Kun Bai: colleagues
Prasenjit Mitra: colleagues
C. Lee Giles: colleagues