|
ABSTRACT
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.
|
 |
4
|
|
| |
5
|
|
| |
6
|
G. Bisson. Learning in FOL with a similarity measure. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 82--87, 1992.
|
| |
7
|
R. J. Bolton and D. J. Hand. Statistical fraud detection: A review (with discussion). Statistical Science, 17(3):235--255, 2002.
|
 |
8
|
Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , Jörg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States
|
| |
9
|
W. Emde and D. Wettschereck. Relational instance-based learning. In Proceedings of the thirteenth International Conference on Machine Learning, 1996.
|
| |
10
|
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Data Mining for Security Applications, 2002.
|
| |
11
|
E. Fix and J. L. Hodges. Discriminatory analysis: Nonparametric discrimination: Small sample performance. Technical Report Project 21-49-004, Report Number 11, USAF School of Aviation Medicine, Randolf Field, Texas, 1952.
|
 |
12
|
|
| |
13
|
D. Hawkins. Identification of outliers. Chapman and Hall, 1980.
|
| |
14
|
S. Hettich and S. D. Bay. The UCI KDD archive. {http://kdd.ics.uci.edu/}. Irvine, CA: University of California, Department of Information and Computer Science, 1999.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
 |
19
|
Sridhar Ramaswamy , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.427-438, May 15-18, 2000, Dallas, Texas, United States
|
| |
20
|
S. Ruggles and M. Sobek. Integrated public use microdata series: Version 2.0. {http://www.ipums.umn.edu/}, 1997.
|
| |
21
|
Rulequest Research. Gritbot. {http://www.rulequest.com/}.
|
CITED BY 33
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Amol Ghoting , Gregory Buehrer , Srinivasan Parthasarathy , Daehyun Kim , Anthony Nguyen , Yen-Kuang Chen , Pradeep Dubey, A characterization of data mining algorithms on a modern processor, Proceedings of the 1st international workshop on Data management on new hardware, June 12-12, 2005, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
Bo Sheng , Qun Li , Weizhen Mao , Wen Jin, Outlier detection in sensor networks, Proceedings of the 8th ACM international symposium on Mobile ad hoc networking and computing, September 09-14, 2007, Montreal, Quebec, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xinjie Lu , Tian Yang , Zaifei Liao , Manzoor Elahi , Wei Liu , Hongan Wang, Incremental outlier detection in data streams using local correlation integral, Proceedings of the 2009 ACM symposium on Applied Computing, March 08-12, 2009, Honolulu, Hawaii
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|