|
ABSTRACT
Assume you are given a data population characterized by a certain number of attributes. Assume, moreover, you are provided with the information that one of the individuals in this data population is abnormal, but no reason whatsoever is given to you as to why this particular individual is to be considered abnormal. In several cases, you will be indeed interested in discovering such reasons. This article is precisely concerned with this problem of discovering sets of attributes that account for the (a priori stated) abnormality of an individual within a given dataset. A criterion is presented to measure the abnormality of combinations of attribute values featured by the given abnormal individual with respect to the reference population. In this respect, each subset of attributes is intended to somehow represent a “property” of individuals. We distinguish between global and local properties. Global properties are subsets of attributes explaining the given abnormality with respect to the entire data population. With local ones, instead, two subsets of attributes are singled out, where the former one justifies the abnormality within the data subpopulation selected using the values taken by the exceptional individual on those attributes included in the latter one. The problem of individuating abnormal properties with associated explanations is formally stated and analyzed. Such a formal characterization is then exploited in order to devise efficient algorithms for detecting both global and local forms of most abnormal properties. The experimental evidence, which is accounted for in the article, shows that the algorithms are both able to mine meaningful information and to accomplish the computational task by examining a negligible fraction of the search space.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), 164--169.
|
| |
6
|
Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.
|
| |
7
|
Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth, Belmont.
|
 |
8
|
Markus M. Breunig , Hans-Peter Kriegel , Raymond T. Ng , Jörg Sander, LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.93-104, May 15-18, 2000, Dallas, Texas, United States
|
 |
9
|
|
| |
10
|
|
| |
11
|
Codd, E., Codd, S., and Salley, C. 1993. Providing OLAP (on-line analytical processing) to user-analysts: An it mandate. Tech. rep., Codd & Date, Inc.
|
| |
12
|
De Benedictis, G., Rose, G., Carrieri, G., Luca, M. D., Falcone, E., Passarino, G., Bonafè, M., Monti, D., Baggio, G., Bertolini, S., Mari, D., Mattace, R., and Franceschi, C. 1999. Mitochondrial DNA inherited variants are associates with successful aging and longevity in humans. FASEB J. 13, 12, 1532--1536.
|
| |
13
|
Garasto, S., Berardelli, M., Rango, F. D., Mari, V., Feraco, E., and Benedictis, G. D. 2004. A study of the average effect of the 3'apob-vntr polymorphism on lipidemic parameters could explain why the short alleles (< 35 repeats) are rare in centenarians. BMC Medical Genetics 5, 3.
|
| |
14
|
|
| |
15
|
Gini, C. 1921. Measurement of inequality of incomes. The Economic J. 31, 124--126.
|
| |
16
|
Griffiths, A. J. F., Miller, J. H., Suzuki, D. T., Lewontin, R. C., and Gelbart, W. M. 1996. An Introduction to Genetic Analysis. W. H. Freeman.
|
| |
17
|
|
| |
18
|
Karp, R. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Plenum, New York, 85--103.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. 1998. UCI repository of machine learning databases.
|
| |
27
|
Papadimitriou, C. 1994. Computational Complexity. Addison-Wesley, USA.
|
| |
28
|
Passarino, G., Montesanto, A., Dato, S., Giordano, S., Domma, F., Mari, V., Feraco, E., and Benedictis, G. D. 2006. Sex and age specificity of susceptibility genes modulating survival at old age. Human Heredity (Int. J. Hum. Medical Genetics) 62, 4, 213--220.
|
 |
29
|
Sridhar Ramaswamy , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.427-438, May 15-18, 2000, Dallas, Texas, United States
|
| |
30
|
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge and Reasoning (KR), 539--550.
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
Suzuki, E. 2006. Data mining methods for discovering interesting exceptions from an unsupervised table. J. Universal Comput. Sci. 12, 6, 627--653.
|
| |
35
|
|
| |
36
|
Wei, L., Qian, W., Zhou, A., Jin, W., and Yu, J. 2003. Hot: Hypergraph-Based outlier test for categorical data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 399--410.
|
| |
37
|
|
| |
38
|
|
| |
39
|
|
|