|
ABSTRACT
Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), and use ad-hoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , Jörg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States
|
| |
5
|
Brodatz, P. (1966) Textures: A Photographic Album for Artists and Designers, Dover Publications,Inc., New York.
|
| |
6
|
Elena project data. ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases/
|
| |
7
|
Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. of the 2nd Intl. Conference on Knowledge Discovery and Data Mining. 226--231.
|
| |
8
|
|
| |
9
|
|
| |
10
|
Guerin-Dugue, A., and Aviles-Cruz, C. (1993) High Order Statistics from Natural Textured Images, ATHOS workshop on System Identification and High Order Statistics. Sophia-Antipolis, France.
|
| |
11
|
Guerin-Dugue, A. et al., (1995) Deliverable R3-B4-P - Task B4: Benchmarks, Technical report, Elena-NervesII "Enhanced Learning for Evolutive Neural Architecture", ESPRIT-Basic Research Project Number 6891.
|
 |
12
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
13
|
Hardin, J., and Rocke, D.M. (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational statistics and data analysis, Vol 44, pp. 625--638.
|
| |
14
|
Hawkins, D. (1980) Identification of Outliers. Chapman and Hall, London.
|
| |
15
|
Hubert, M., Rousseeuw, P.J, and Van Aelst, S. (2005) Multivariate Outlier Detection and Robustness. In Handbook of Statistics, Vol. 24, C.R. Rao, E. Wegman, J. Solka, editors. Elsevier.
|
 |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
Lewis, B.V. (1994) Outliers in Statistical Data. John Wiley.
|
| |
20
|
|
| |
21
|
|
| |
22
|
Menasce, D., Abrahão, B., Barbará, D., Almeida, V., Ribeiro, F. (2002) Fractal Characterization of Web Workloads. Proceedings of the "Web Engineering" Track of WWW2002, Honolulu, Hawaii, USA , 7--11.
|
| |
23
|
|
| |
24
|
|
| |
25
|
Sagan, H. (1994)Space Filling Curves. Springer-Verlag.
|
 |
26
|
Sridhar Ramaswamy , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.427-438, May 15-18, 2000, Dallas, Texas, United States
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
UCI Machine Learning Repository. http://www.ics.uci.edu/ mlearn/MLRepository.html
|
| |
31
|
Vapnik. V. (1998) Statistical Learning Theory, New York: Wiley.
|
| |
32
|
|
| |
33
|
Ho, S.S., and Wechsler, H. (2003) Transductive Confidence Machine for Active Learning, Int. Joint Conf. on Neural Networks, Portland, OR.
|
|