ACM Home Page
Please provide us with feedback. Feedback
DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets
Full text PdfPdf (703 KB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 3 ,  Issue 1  (March 2009) table of contents
Article No. 4  
Year of Publication: 2009
ISSN:1556-4681
Authors
Fabrizio Angiulli  DEIS, Università della Calabria, Rende(CS), Italy
Fabio Fassetti  DEIS, Università della Calabria, Rende(CS), Italy
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 257,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1497577.1497581
What is a DOI?

ABSTRACT

In this work a novel distance-based outlier detection algorithm, named DOLPHIN, working on disk-resident datasets and whose I/O cost corresponds to the cost of sequentially reading the input dataset file twice, is presented.

It is both theoretically and empirically shown that the main memory usage of DOLPHIN amounts to a small fraction of the dataset and that DOLPHIN has linear time performance with respect to the dataset size. DOLPHIN gains efficiency by naturally merging together in a unified schema three strategies, namely the selection policy of objects to be maintained in main memory, usage of pruning rules, and similarity search techniques. Importantly, similarity search is accomplished by the algorithm without the need of preliminarily indexing the whole dataset, as other methods do.

The algorithm is simple to implement and it can be used with any type of data, belonging to either metric or nonmetric spaces. Moreover, a modification to the basic method allows DOLPHIN to deal with the scenario in which the available buffer of main memory is smaller than its standard requirements. DOLPHIN has been compared with state-of-the-art distance-based outlier detection algorithms, showing that it is much more efficient.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
 
5
Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'96), 164--169.
 
6
Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.
7
8
9
 
10
11
12
13
 
14
 
15
Davies, L. and Gather, U. 1989. The identification of multiple outliers. Tech. rep. 89/1, Department of Statistics, University of Dortmund.
 
16
Davies, L. and Gather, U. 1993. The identification of multiple outliers. J. Amer. Statist. Assoc. 88, 782--792.
 
17
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. 2002. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security. Kluwer.
 
18
Ghoting, A., Parthasarathy, S., and Otey, M. 2006. Fast mining of distance-based outliers in high-dimensional datasets. In Proceedings of the SIAM International Conference on Data Mining (SDM'06).
 
19
Hawkins, D. 1980. Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman & Hall.
20
 
21
 
22
 
23
 
24
 
25
Lazarevic, A., Ertöz, L., Kumar, V., Ozgur, A., and Srivastava, J. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM International Conference on Data Mining.
 
26
Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the International Conference on Data Enginnering (ICDE), 315--326.
27
 
28
Rider, P. 1962. The negative binomial distribution and the incomplete beta function. The Amer. Math. Monthly 69, 4, 302--304.
 
29
 
30
 
31
Schultze, V. and Pawlitschko, J. 2002. The identification of outliers in exponential samples. Statistica Neerlandica 56, 1, 41--57.
32
 
33
Uhlmann, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4, 175--179.
 
34
Watanabe, O. 2000. Simple sampling techniques for discovery science. TIEICE: IEICE Trans. Commun./Electron./Inf. Syst.

Collaborative Colleagues:
Fabrizio Angiulli: colleagues
Fabio Fassetti: colleagues