| Real-time data pre-processing technique for efficient feature extraction in large scale datasets |
| Full text |
Pdf
(330 KB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceeding of the 17th ACM conference on Information and knowledge management
table of contents
Napa Valley, California, USA
SESSION: KM: feature selection
table of contents
Pages 981-990
Year of Publication: 2008
ISBN:978-1-59593-991-3
|
|
Authors
|
|
Ying Liu
|
The Pennsylvania State University, University Park, PA, USA
|
|
Lucian V. Lita
|
Siemens Medical Solutions, Malven, PA, USA
|
|
R. Stefan Niculescu
|
Siemens Medical Solutions, Malven, PA, USA
|
|
Kun Bai
|
The Pennsylvania State University, University Park, PA, USA
|
|
Prasenjit Mitra
|
The Pennsylvania State University, University Park, PA, USA
|
|
C. Lee Giles
|
The Pennsylvania State University, University Park, PA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 163, Citation Count: 0
|
|
|
ABSTRACT
Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/genomicsequence/chromosomes/gcg/.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
|
 |
11
|
|
 |
12
|
Anthony Don , Elena Zheleva , Machon Gregory , Sureyya Tarkan , Loretta Auvil , Tanya Clement , Ben Shneiderman , Catherine Plaisant, Discovering interesting usage patterns in text collections: integrating text mining with visualization, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
[doi> 10.1145/1321440.1321473]
|
| |
13
|
V. P. Donald Knuth; James H. Morris, Jr. Fast pattern matching in strings. In SIAM Journal on Computing, pages 323--350, 1977.
|
| |
14
|
|
 |
15
|
Vassil Gedov , Carsten Stolz , Ralph Neuneier , Michal Skubacz , Dietmar Seipel, Matching web site structure and content, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
[doi> 10.1145/1013367.1013438]
|
 |
16
|
|
| |
17
|
S. Kim and Y. Kim. A fast multiple string-pattern matching algorithm. In Proc. of 17th AoM/IAoM Conference on Computer Science, Aug. 1999.
|
| |
18
|
Y. Liu, L. V. Lita, S. Niculescu, P. Mitra, and C. L. Giles. Finding a haystack in haystacks - simultaneous identificcation ofconcepts in large bio-medical corpora. SIAM SDM 2008.
|
| |
19
|
U. Manber. Agrep, an approximate grep. In http://www.tgries.de/agrep/, 2005.
|
 |
20
|
|
| |
21
|
|
| |
22
|
U. M. Sun Wu. A fast algorithm for multi-pattern searching. In Technical Report TR 94-17, University of Arizona at Tuscon, May 1994.
|
 |
23
|
|
| |
24
|
B. W. Watson and R. E. Watson. A new family of string pattern matching algorithms. South African Computer Journal, 30:34--41, 2003.
|
| |
25
|
S. Wu and U. Manber. Fast text searching with errors. Technical Report TR-91-11, 1991.
|
|