|
ABSTRACT
For many KDD operations, such as nearest neighbor search, distance-based clustering, and outlier detection, there is an underlying &kgr;-D data space in which each tuple/object is represented as a point in the space. In the presence of differing scales, variability, correlation, and/or outliers, we may get unintuitive results if an inappropriate space is used.The fundamental question that this paper addresses is: "What then is an appropriate space?" We propose using a robust space transformation called the Donoho-Stahel estimator. In the first half of the paper, we show the key properties of the estimator. Of particular importance to KDD applications involving databases is the stability property, which says that in spite of frequent updates, the estimator does not: (a) change much, (b) lose its usefulness, or (c) require re-computation. In the second half, we focus on the computation of the estimator for high-dimensional databases. We develop randomized algorithms and evaluate how well they perform empirically. The novel algorithm we develop called the Hybrid-random algorithm is, in most cases, at least an order of magnitude faster than the Fixed-angle and Subsampling algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
| |
2
|
H. Anton and C. Rorres. Elementary Linear Algebra: Applications Version. John Wiley & Sons, 1994.
|
| |
3
|
V. Barnett and T. Lewis. Outliers in Statistical Data. 3rd edition. John Wiley & Sons, 1994.
|
 |
4
|
Norbert Beckmann , Hans-Peter Kriegel , Ralf Schneider , Bernhard Seeger, The R*-tree: an efficient and robust access method for points and rectangles, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.322-331, May 23-26, 1990, Atlantic City, New Jersey, United States
|
 |
5
|
Kristin P. Bennett , Usama Fayyad , Dan Geiger, Density-based indexing for approximate nearest-neighbor queries, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.233-243, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312236]
|
 |
6
|
|
 |
7
|
Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , Jörg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States
|
| |
8
|
G.E.P. Box and D.R. Cox. An Analysis of Transformations (with Discussion). In Journal of the Royal Statistical Society, 26, Series B (Methodological), pp. 211-252, 1964.
|
| |
9
|
R. Burden and J. Faires. Numerical Analysis. PWS Publishing, 1993.
|
| |
10
|
C. Croux and A. Ruiz-Gazen. A fast algorithm for robust principal components based on projection pursuit. In Prat, A., editor, Compstat: Proceedings in Computational Statistics, Heidelberg: Physica-Verlag, pp. 211-216.
|
| |
11
|
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. 1996 KDD, pp. 226-231.
|
| |
12
|
C. Faloutsos , R. Barber , M. Flickner , J. Hafner , W. Niblack , D. Petkovic , W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994
[doi> 10.1007/BF00962238]
|
 |
13
|
|
| |
14
|
A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 KDD, pp. 58-65.
|
| |
15
|
I. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.
|
| |
16
|
|
| |
17
|
G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. In Journal of the American Statistical Association, 80, pp. 759-766.
|
| |
18
|
R. Martin and R. Zamar. Bias robust estimation of scale. In The Annals of Statistics, 21, 2, pp. 991-1017, 1993.
|
| |
19
|
R. Maronna and V. Yohai. The behaviour of the Stahel-Donoho robust multivariate estimator. In Journal of the American Statistical Association, 90 (429), pp. 330-341, 1995.
|
| |
20
|
|
 |
21
|
Sridhar Ramaswamy , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.427-438, May 15-18, 2000, Dallas, Texas, United States
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
W. A. Stahel. Breakdown of Covariance Estimators. Research report, 31, Fachgruppe fur Statistik, ETH, Zurich, 1981.
|
| |
26
|
|
| |
27
|
V. Yohai and R. Zamar. High breakdown point estimates of regression by means of the minimization of an efficient scale. In Journal of the American Statistical Association, 83 (402), pp. 406-413, 1988.
|
|