ACM Home Page
Please provide us with feedback. Feedback
Towards systematic design of distance functions for data mining applications
Full text PdfPdf (215 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Washington, D.C.
SESSION: Research track table of contents
Pages: 9 - 18  
Year of Publication: 2003
ISBN:1-58113-737-0
Author
Charu C. Aggarwal  IBM T. J. Watson Research Center, Hawthorne, NY
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 75,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956750.956756
What is a DOI?

ABSTRACT

Distance function computation is a key subtask in many data mining algorithms and applications. The most effective form of the distance function can only be expressed in the context of a particular data domain. It is also often a challenging and non-trivial task to find the most effective form of the distance function. For example, in the text domain, distance function design has been considered such an important and complex issue that it has been the focus of intensive research over three decades. The final design of distance functions in this domain has been reached only by detailed empirical testing and consensus over the quality of results provided by the different variations. With the increasing ability to collect data in an automated way, the number of new kinds of data continues to increase rapidly. This makes it increasingly difficult to undertake such efforts for each and every new data type. The most important aspect of distance function design is that since a human is the end-user for any application, the design must satisfy the user requirements with regard to effectiveness. This creates the need for a systematic framework to design distance functions which are sensitive to the particular characteristics of the data domain. In this paper, we discuss such a framework. The goal is to create distance functions in an automated waywhile minimizing the work required from the user. We will show that this framework creates distance functions which are significantly more effective than popularly used functions such as the Euclidean metric.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
C. C. Aggarwal. Towards Meaningful High Dimensional Nearest Neighbor Search by Human-Computer Interaction. ICDE Conference, 2001.
 
3
C. C. Aggarwal, P. S. Yu. The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space. KDD Conference, 2001.
 
4
 
5
 
6
D. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd Edition, 1999.
 
7
 
8
 
9
J. Foote. A Similarity Measure for Automatic Audio Classification. AAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video, and Audio Corpora, 1997.
10
 
11
 
12
S. Deerwester et al. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391--407, 1990.
 
13
 
14
 
15
P. Moen. Attribute, Event Sequence, and Event Type Similarity Notions for Data Mining. ining. PhD Thesis, Report A-2000-1, Department of Computer Science, University of Helsinki, February 2000. http://www.cs.helsinki.fi/TR/A-2000/1/.
 
16
 
17
 
18
M. M. Richter. On the notion of similarity in case-based reasoning. Mathematical and Statistical Methods in Artificial Intelligence (ed. G. della Riccia et al). Springer Verlag, 1995, p. 171--184.
 
19
 
20
21
 
22

CITED BY  8