| CoCo: coding cost for parameter-free outlier detection |
| Full text |
Mov
(4:59),
Pdf
(6.52 MB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Research track papers
table of contents
Pages 149-158
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 66, Downloads (12 Months): 142, Citation Count: 0
|
|
|
ABSTRACT
How can we automatically spot all outstanding observations in a data set? This question arises in a large variety of applications, e.g. in economy, biology and medicine. Existing approaches to outlier detection suffer from one or more of the following drawbacks: The results of many methods strongly depend on suitable parameter settings being very difficult to estimate without background knowledge on the data, e.g. the minimum cluster size or the number of desired outliers. Many methods implicitly assume Gaussian or uniformly distributed data, and/or their result is difficult to interpret. To cope with these problems, we propose CoCo, a technique for parameter-free outlier detection. The basic idea of our technique relates outlier detection to data compression: Outliers are objects which can not be effectively compressed given the data set. To avoid the assumption of a certain data distribution, CoCo relies on a very general data model combining the Exponential Power Distribution with Independent Components. We define an intuitive outlier factor based on the principle of the Minimum Description Length together with an novel algorithm for outlier detection. An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our technique. Availability: The source code of CoCo and the data sets used in the experiments are available at: http://www.dbs.ifi.lmu.de/Forschung/KDD/Boehm/CoCo.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Christian Böhm , Christos Faloutsos , Jia-Yu Pan , Claudia Plant, Robust information-theoretic clustering, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150414]
|
 |
2
|
|
 |
3
|
Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , Jörg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States
|
 |
4
|
|
| |
5
|
D. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.
|
| |
6
|
A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. 2001.
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 219--222, 1997.
|
| |
11
|
|
| |
12
|
|
| |
13
|
A. Mineo and M. Ruggieri. A software tool for the exponential power distribution: The normalp package. Journal of Statistical Software, 12(4), 1 2005.
|
| |
14
|
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE, pages 315--, 2003.
|
| |
15
|
|
| |
16
|
J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 46(7):2537--2543, 2000.
|
| |
17
|
M. Robnik-Sikonja and I. Kononenko. Pruning regression trees with mdl. In ECAI, pages 455--459, 1998.
|
| |
18
|
J. Xie, D. Zhang, and W. Xu. Spatially adaptive wavelet denoising using the minimum description length principle. IEEE Transactions on Image Processing, 13(2):179--187, 2004.
|
| |
19
|
|
|