| Scalable robust covariance and correlation estimates for data mining |
| Full text |
Pdf
(899 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Edmonton, Alberta, Canada
SESSION: Statistical methods I
table of contents
Pages: 14 - 23
Year of Publication: 2002
ISBN:1-58113-567-X
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 15, Downloads (12 Months): 65, Citation Count: 2
|
|
|
ABSTRACT
Covariance and correlation estimates have important applications in data mining. In the presence of outliers, classical estimates of covariance and correlation matrices are not reliable. A small fraction of outliers, in some cases even a single outlier, can distort the classical covariance and correlation estimates making them virtually useless. That is, correlations for the vast majority of the data can be very erroneously reported; principal components transformations can be misleading; and multidimensional outlier detection via Mahalanobis distances can fail to detect outliers. There is plenty of statistical literature on robust covariance and correlation matrix estimates with an emphasis on affine-equivariant estimators that possess high breakdown points and small worst case biases. All such estimators have unacceptable exponential complexity in the number of variables and quadratic complexity in the number of observations. In this paper we focus on several variants of robust covariance and correlation matrix estimates with quadratic complexity in the number of variables and linear complexity in the number of observations. These estimators are based on several forms of pairwise robust covariance and correlation estimates. The estimators studied include two fast estimators based on coordinate-wise robust transformations embedded in an overall procedure recently proposed by [14]. We show that the estimators have attractive robustness properties, and give an example that uses one of the estimators in the new Insightful Miner data mining product.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
M. B. Abdullah. On a Robust Correlation Coefficient. In The Statistician, 39, pp. 455--460, 1990.
|
| |
2
|
P. Davies. Asymptotic Behavior of S-Estimates of Multivariate Location Parameters and Dispersion Matrices. In The Annals of Statistics, 15, pp. 1269--1292, 1987.
|
| |
3
|
S. J. Devlin, R. Gnanadesikan and J. R. Kettenring. Robust Estimation of Dispersion Matrices and Principal Components. In Journal of the American Statistical Association, 76, pp. 354--362, 1981.
|
| |
4
|
D. L. Donoho. Breakdown Properties of Multivariate Location Estimators. Ph.D. Qualifying Paper. Dept. of Statistic, Harvard University, 1982.
|
| |
5
|
R. Gnanadesikan and J. R. Kettenring. Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. In Biometrics, 28, pp. 81--124, 1972.
|
| |
6
|
F. Hampel, P. Ronchetti, P. Rousseeuw and W. Stahel. Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, 1986.
|
| |
7
|
P. J. Huber. Robust Statistics. John Wiley & Sons, 1981.
|
 |
8
|
|
| |
9
|
A. Marazzi and C. Ruffieux. Implementing M-Estimators of the Gamma Distribution, in Robust Statistics. :Data Analysis and Computer Intensive Methods, in Honor of Peter J. Huber's 60th Birthday, Springer Verlag, 1996.
|
| |
10
|
R. Maronna. Personal Communication. In International Conference on Robust Statistics, 2002.
|
| |
11
|
R. Maronna. Robust M-Estimators of Multivariate Location and Scatter. In The Annals of Statistics, 4, pp. 51--67, 1976.
|
| |
12
|
|
| |
13
|
R. Maronna and V. Yohai. The Behaviour of the Stahel-Donoho Robust Multivariate Estimator. In Journal of the American Statistical Association, 90 (429), pp. 330--341, 1995.
|
| |
14
|
R. Maronna and R. Zamar. Robust Estimates of Location and Dispersion for High Dimensional Data Sets. In Technometrics, to appear, 2002.
|
| |
15
|
D.M. Rocke and D.L. Woodruff. Identification of Outliers in Multivariate Data. In Journal of the American Statistical Association, 91 (435), pp. 1047--1061, 1996.
|
| |
16
|
P. Rousseeuw. Least Median of Squares Regression. In Journal of the American Statistical Association, 79, pp. 871--880, 1984.
|
| |
17
|
P. Rousseeuw. Multivariate Estimation with High Breakdown Point. Mathematical Statistics and Applications, pp. 283--297, Reidel Publishing, 1985.
|
| |
18
|
|
| |
19
|
|
| |
20
|
W. A. Stahel. Breakdown of Covariance Estimators. Research report, 31, Fachgruppe fur Statistik, ETH, Zurich, 1981.
|
|