|
ABSTRACT
Distance function computation is a key subtask in many data mining algorithms and applications. The most effective form of the distance function can only be expressed in the context of a particular data domain. It is also often a challenging and non-trivial task to find the most effective form of the distance function. For example, in the text domain, distance function design has been considered such an important and complex issue that it has been the focus of intensive research over three decades. The final design of distance functions in this domain has been reached only by detailed empirical testing and consensus over the quality of results provided by the different variations. With the increasing ability to collect data in an automated way, the number of new kinds of data continues to increase rapidly. This makes it increasingly difficult to undertake such efforts for each and every new data type. The most important aspect of distance function design is that since a human is the end-user for any application, the design must satisfy the user requirements with regard to effectiveness. This creates the need for a systematic framework to design distance functions which are sensitive to the particular characteristics of the data domain. In this paper, we discuss such a framework. The goal is to create distance functions in an automated waywhile minimizing the work required from the user. We will show that this framework creates distance functions which are significantly more effective than popularly used functions such as the Euclidean metric.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
C. C. Aggarwal. Towards Meaningful High Dimensional Nearest Neighbor Search by Human-Computer Interaction. ICDE Conference, 2001.
|
| |
3
|
C. C. Aggarwal, P. S. Yu. The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space. KDD Conference, 2001.
|
| |
4
|
|
| |
5
|
Rakesh Agrawal , King-Ip Lin , Harpreet S. Sawhney , Kyuseok Shim, Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases, Proceedings of the 21th International Conference on Very Large Data Bases, p.490-501, September 11-15, 1995
|
| |
6
|
D. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd Edition, 1999.
|
| |
7
|
|
| |
8
|
|
| |
9
|
J. Foote. A Similarity Measure for Automatic Audio Classification. AAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video, and Audio Corpora, 1997.
|
 |
10
|
|
| |
11
|
|
| |
12
|
S. Deerwester et al. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391--407, 1990.
|
| |
13
|
|
| |
14
|
|
| |
15
|
P. Moen. Attribute, Event Sequence, and Event Type Similarity Notions for Data Mining. ining. PhD Thesis, Report A-2000-1, Department of Computer Science, University of Helsinki, February 2000. http://www.cs.helsinki.fi/TR/A-2000/1/.
|
| |
16
|
|
| |
17
|
|
| |
18
|
M. M. Richter. On the notion of similarity in case-based reasoning. Mathematical and Statistical Methods in Artificial Intelligence (ed. G. della Riccia et al). Springer Verlag, 1995, p. 171--184.
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
|
|