|
ABSTRACT
In this work a novel distance-based outlier detection algorithm, named DOLPHIN, working on disk-resident datasets and whose I/O cost corresponds to the cost of sequentially reading the input dataset file twice, is presented. It is both theoretically and empirically shown that the main memory usage of DOLPHIN amounts to a small fraction of the dataset and that DOLPHIN has linear time performance with respect to the dataset size. DOLPHIN gains efficiency by naturally merging together in a unified schema three strategies, namely the selection policy of objects to be maintained in main memory, usage of pruning rules, and similarity search techniques. Importantly, similarity search is accomplished by the algorithm without the need of preliminarily indexing the whole dataset, as other methods do. The algorithm is simple to implement and it can be used with any type of data, belonging to either metric or nonmetric spaces. Moreover, a modification to the basic method allows DOLPHIN to deal with the scenario in which the available buffer of main memory is smaller than its standard requirements. DOLPHIN has been compared with state-of-the-art distance-based outlier detection algorithms, showing that it is much more efficient.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'96), 164--169.
|
| |
6
|
Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.
|
 |
7
|
|
 |
8
|
Norbert Beckmann , Hans-Peter Kriegel , Ralf Schneider , Bernhard Seeger, The R*-tree: an efficient and robust access method for points and rectangles, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.322-331, May 23-26, 1990, Atlantic City, New Jersey, United States
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , Jörg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States
|
 |
13
|
|
| |
14
|
|
| |
15
|
Davies, L. and Gather, U. 1989. The identification of multiple outliers. Tech. rep. 89/1, Department of Statistics, University of Dortmund.
|
| |
16
|
Davies, L. and Gather, U. 1993. The identification of multiple outliers. J. Amer. Statist. Assoc. 88, 782--792.
|
| |
17
|
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. 2002. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security. Kluwer.
|
| |
18
|
Ghoting, A., Parthasarathy, S., and Otey, M. 2006. Fast mining of distance-based outliers in high-dimensional datasets. In Proceedings of the SIAM International Conference on Data Mining (SDM'06).
|
| |
19
|
Hawkins, D. 1980. Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman & Hall.
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
Lazarevic, A., Ertöz, L., Kumar, V., Ozgur, A., and Srivastava, J. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM International Conference on Data Mining.
|
| |
26
|
Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the International Conference on Data Enginnering (ICDE), 315--326.
|
 |
27
|
Sridhar Ramaswamy , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.427-438, May 15-18, 2000, Dallas, Texas, United States
|
| |
28
|
Rider, P. 1962. The negative binomial distribution and the incomplete beta function. The Amer. Math. Monthly 69, 4, 302--304.
|
| |
29
|
|
| |
30
|
|
| |
31
|
Schultze, V. and Pawlitschko, J. 2002. The identification of outliers in exponential samples. Statistica Neerlandica 56, 1, 41--57.
|
 |
32
|
|
| |
33
|
Uhlmann, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4, 175--179.
|
| |
34
|
Watanabe, O. 2000. Simple sampling techniques for discovery science. TIEICE: IEICE Trans. Commun./Electron./Inf. Syst.
|
|