ACM Home Page
Please provide us with feedback. Feedback
Detecting outliers using transduction and statistical testing
Full text PdfPdf (866 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Philadelphia, PA, USA
SESSION: Research track papers table of contents
Pages: 55 - 64  
Year of Publication: 2006
ISBN:1-59593-339-5
Authors
Daniel Barbará  George Mason University, Fairfax, VA
Carlotta Domeniconi  George Mason University, Fairfax, VA
James P. Rogers  U.S. Army Engineer Research and Development Center, Alexandria, VA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 155,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1150402.1150413
What is a DOI?

ABSTRACT

Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), and use ad-hoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
4
 
5
Brodatz, P. (1966) Textures: A Photographic Album for Artists and Designers, Dover Publications,Inc., New York.
 
6
Elena project data. ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases/
 
7
Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. of the 2nd Intl. Conference on Knowledge Discovery and Data Mining. 226--231.
 
8
 
9
 
10
Guerin-Dugue, A., and Aviles-Cruz, C. (1993) High Order Statistics from Natural Textured Images, ATHOS workshop on System Identification and High Order Statistics. Sophia-Antipolis, France.
 
11
Guerin-Dugue, A. et al., (1995) Deliverable R3-B4-P - Task B4: Benchmarks, Technical report, Elena-NervesII "Enhanced Learning for Evolutive Neural Architecture", ESPRIT-Basic Research Project Number 6891.
12
 
13
Hardin, J., and Rocke, D.M. (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational statistics and data analysis, Vol 44, pp. 625--638.
 
14
Hawkins, D. (1980) Identification of Outliers. Chapman and Hall, London.
 
15
Hubert, M., Rousseeuw, P.J, and Van Aelst, S. (2005) Multivariate Outlier Detection and Robustness. In Handbook of Statistics, Vol. 24, C.R. Rao, E. Wegman, J. Solka, editors. Elsevier.
16
17
 
18
 
19
Lewis, B.V. (1994) Outliers in Statistical Data. John Wiley.
 
20
 
21
 
22
Menasce, D., Abrahão, B., Barbará, D., Almeida, V., Ribeiro, F. (2002) Fractal Characterization of Web Workloads. Proceedings of the "Web Engineering" Track of WWW2002, Honolulu, Hawaii, USA , 7--11.
 
23
 
24
 
25
Sagan, H. (1994)Space Filling Curves. Springer-Verlag.
26
 
27
 
28
 
29
 
30
UCI Machine Learning Repository. http://www.ics.uci.edu/ mlearn/MLRepository.html
 
31
Vapnik. V. (1998) Statistical Learning Theory, New York: Wiley.
 
32
 
33
Ho, S.S., and Wechsler, H. (2003) Transductive Confidence Machine for Active Learning, Int. Joint Conf. on Neural Networks, Portland, OR.


Collaborative Colleagues:
Daniel Barbará: colleagues
Carlotta Domeniconi: colleagues
James P. Rogers: colleagues