|
ABSTRACT
In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is difficult to determine the most relevant features for a classification task. For example, a typical DNA string may be millions of characters long, and there may be thousands of such strings in a database. In many cases, the classification behavior of the data may be hidden in the compositional behavior of certain segments of the string which cannot be easily determined apriori. Another problem which complicates the classification task is that in some cases the classification behavior is reflected in global behavior of the string, whereas in others it is reflected in local patterns. Given the enormous variation in the behavior of the strings over different data sets, it is useful to develop an approach which is sensitive to both the global and local behavior of the strings for the purpose of classification. For this purpose, we will exploit the multi-resolution property of wavelet decomposition in order to create a scheme which can mine classification characteristics at different levels of granularity. The resulting scheme turns out to be very effective in practice on a wide range of problems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Rakesh Agrawal , King-Ip Lin , Harpreet S. Sawhney , Kyuseok Shim, Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases, Proceedings of the 21th International Conference on Very Large Data Bases, p.490-501, September 11-15, 1995
|
| |
2
|
|
| |
3
|
|
| |
4
|
M. Deshpande, G. Karypis. Evaluation of Techniques for Classifying Biological Sequences. Technical report, TR 01--33, University of Minnesota, 2001.
|
| |
5
|
R. Duda, P. Hart. Pattern Analysis and scene analysis, Wiley 19773.
|
 |
6
|
Johannes Gehrke , Venkatesh Ganti , Raghu Ramakrishnan , Wei-Yin Loh, BOAT—optimistic decision tree construction, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.169-180, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
7
|
|
| |
8
|
J, Gehrke, W.-Y. Lob, R. Ramakrishnan. Data Mining with Decision Trees. ACM SIGKDD Conference Tutorial, 1999.
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
J. Han, G. Dong, Y. Yin. Efficient Mining of partial periodic patterns in time series databases. ICDE Conference, 1999.
|
| |
13
|
|
 |
14
|
H. V. Jagadish , Nick Koudas , Divesh Srivastava, On effective multi-dimensional indexing for strings, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.403-414, May 15-18, 2000, Dallas, Texas, United States
|
| |
15
|
|
| |
16
|
|
| |
17
|
D. A. Keim, M. Heczko. Wavelets and their Applications in Databases. ICDE Conference, 2001.
|
| |
18
|
E. J. Keogh, M. J. Pazzini. An enhanced representation of time series data which allows fast and accurate classification, clustering and relevance feedback. KDD Conference, 1998.
|
| |
19
|
E. Keogh, P. Smyth. A probabilistic approach to pattern matching in time-series databases. KDD Conference, 1997.
|
 |
20
|
Eamonn Keogh , Kaushik Chakrabarti , Michael Pazzani , Sharad Mehrotra, Locally adaptive dimensionality reduction for indexing large time series databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.151-162, May 21-24, 2001, Santa Barbara, California, United States
|
| |
21
|
B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. KDD Conference, 1998.
|
| |
22
|
S. Manganaris. Learning to Classify Sensor Data. TR-CS-95-10, Vanderbilt University, March 1995.
|
 |
23
|
|
| |
24
|
C. Perng, H. Wang, S. Zhang, S. Parker. Landmarks: A new model for similarity-based pattern querying in time-series databases, ICDE Conference, 2000.
|
|