ACM Home Page
Please provide us with feedback. Feedback
Robust space transformations for distance-based operations
Full text PdfPdf (905 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Francisco, California
Pages: 126 - 135  
Year of Publication: 2001
ISBN:1-58113-391-X
Authors
Edwin M. Knorr  Univ. of British Columbia, Vancouver, BC Canada
Raymond T. Ng  Univ. of British Columbia, Vancouver, BC Canada
Ruben H. Zamar  Univ. of British Columbia, Vancouver, BC Canada
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
AAAI : American Association for Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 45,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/502512.502532
What is a DOI?

ABSTRACT

For many KDD operations, such as nearest neighbor search, distance-based clustering, and outlier detection, there is an underlying &kgr;-D data space in which each tuple/object is represented as a point in the space. In the presence of differing scales, variability, correlation, and/or outliers, we may get unintuitive results if an inappropriate space is used.The fundamental question that this paper addresses is: "What then is an appropriate space?" We propose using a robust space transformation called the Donoho-Stahel estimator. In the first half of the paper, we show the key properties of the estimator. Of particular importance to KDD applications involving databases is the stability property, which says that in spite of frequent updates, the estimator does not: (a) change much, (b) lose its usefulness, or (c) require re-computation. In the second half, we focus on the computation of the estimator for high-dimensional databases. We develop randomized algorithms and evaluate how well they perform empirically. The novel algorithm we develop called the Hybrid-random algorithm is, in most cases, at least an order of magnitude faster than the Fixed-angle and Subsampling algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
H. Anton and C. Rorres. Elementary Linear Algebra: Applications Version. John Wiley & Sons, 1994.
 
3
V. Barnett and T. Lewis. Outliers in Statistical Data. 3rd edition. John Wiley & Sons, 1994.
4
5
6
7
 
8
G.E.P. Box and D.R. Cox. An Analysis of Transformations (with Discussion). In Journal of the Royal Statistical Society, 26, Series B (Methodological), pp. 211-252, 1964.
 
9
R. Burden and J. Faires. Numerical Analysis. PWS Publishing, 1993.
 
10
C. Croux and A. Ruiz-Gazen. A fast algorithm for robust principal components based on projection pursuit. In Prat, A., editor, Compstat: Proceedings in Computational Statistics, Heidelberg: Physica-Verlag, pp. 211-216.
 
11
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. 1996 KDD, pp. 226-231.
 
12
13
 
14
A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 KDD, pp. 58-65.
 
15
I. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.
 
16
 
17
G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. In Journal of the American Statistical Association, 80, pp. 759-766.
 
18
R. Martin and R. Zamar. Bias robust estimation of scale. In The Annals of Statistics, 21, 2, pp. 991-1017, 1993.
 
19
R. Maronna and V. Yohai. The behaviour of the Stahel-Donoho robust multivariate estimator. In Journal of the American Statistical Association, 90 (429), pp. 330-341, 1995.
 
20
21
 
22
 
23
 
24
 
25
W. A. Stahel. Breakdown of Covariance Estimators. Research report, 31, Fachgruppe fur Statistik, ETH, Zurich, 1981.
 
26
 
27
V. Yohai and R. Zamar. High breakdown point estimates of regression by means of the minimization of an efficient scale. In Journal of the American Statistical Association, 83 (402), pp. 406-413, 1988.


Collaborative Colleagues:
Edwin M. Knorr: colleagues
Raymond T. Ng: colleagues
Ruben H. Zamar: colleagues