ACM Home Page
Please provide us with feedback. Feedback
Mining distance-based outliers in near linear time with randomization and a simple pruning rule
Full text PdfPdf (221 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Washington, D.C.
SESSION: Research track table of contents
Pages: 29 - 38  
Year of Publication: 2003
ISBN:1-58113-737-0
Authors
Stephen D. Bay  Institute for the Study of Learning and Expertise, Palo Alto, CA
Mark Schwabacher  NASA Ames Research Center, Moffet Field, CA
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 118,   Citation Count: 33
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956750.956758
What is a DOI?

ABSTRACT

Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.
4
 
5
 
6
G. Bisson. Learning in FOL with a similarity measure. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 82--87, 1992.
 
7
R. J. Bolton and D. J. Hand. Statistical fraud detection: A review (with discussion). Statistical Science, 17(3):235--255, 2002.
8
 
9
W. Emde and D. Wettschereck. Relational instance-based learning. In Proceedings of the thirteenth International Conference on Machine Learning, 1996.
 
10
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Data Mining for Security Applications, 2002.
 
11
E. Fix and J. L. Hodges. Discriminatory analysis: Nonparametric discrimination: Small sample performance. Technical Report Project 21-49-004, Report Number 11, USAF School of Aviation Medicine, Randolf Field, Texas, 1952.
12
 
13
D. Hawkins. Identification of outliers. Chapman and Hall, 1980.
 
14
S. Hettich and S. D. Bay. The UCI KDD archive. {http://kdd.ics.uci.edu/}. Irvine, CA: University of California, Department of Information and Computer Science, 1999.
 
15
 
16
 
17
18
19
 
20
S. Ruggles and M. Sobek. Integrated public use microdata series: Version 2.0. {http://www.ipums.umn.edu/}, 1997.
 
21
Rulequest Research. Gritbot. {http://www.rulequest.com/}.

CITED BY  33

Collaborative Colleagues:
Stephen D. Bay: colleagues
Mark Schwabacher: colleagues