|
ABSTRACT
Decision makers of companies often face the dilemma of whether to release data for knowledge discovery, vis-a-vis the risk of disclosing proprietary or sensitive information. Among the various methods employed for “sanitizing” the data prior to disclosure, we focus in this article on anonymization, given its widespread use in practice. We do due diligence to the question “just how safe is the anonymized data?” We consider both those scenarios when the hacker has no information and, more realistically, when the hacker may have partial information about items in the domain. We conduct our analyses in the context of frequent set mining and address the safety question at two different levels: (i) how likely of being cracked (i.e., re-identified by a hacker), are the identities of individual items and (ii) how likely are sets of items cracked? For capturing the prior knowledge of the hacker, we propose a belief function, which amounts to an educated guess of the frequency of each item. For various classes of belief functions which correspond to different degrees of prior knowledge, we derive formulas for computing the expected number of cracks of single items and for itemsets, the probability of cracking the itemsets. While obtaining, exact values for more general situations is computationally hard, we propose a series of heuristics called the O-estimates. They are easy to compute and are shown fairly accurate, justified by empirical results on real benchmark datasets. Based on the O-estimates, we propose a recipe for the decision makers to resolve their dilemma. Our recipe operates at two different levels, depending on whether the data owner wants to reason in terms of single items or sets of items (or both). Finally, we present techniques for ascertaining a hacker's knowledge of correlation in terms of co-occurrence of items likely. This information regarding the hacker's knowledge can be incorporated into our framework of disclosure risk analysis and we present experimental results demonstrating how this knowledge affects the heuristic estimates we have developed.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
5
|
Aggarwal, C. C. and Yu, P. S. 2004. A condensation approach to privacy preserving data mining. In Proceedings of the International Conference on Extending Database Technology (EDBT), 183--199.
|
| |
6
|
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., and Zhu, A. 2005. Anonymizing tables. In Proceedings of the International Conference on Database Theory (ICDT), 246--258.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
Fienberg, S. E., Makov, U. E., and Steele, R. J. 1998. Disclosure limitation using perturbation and related methods for categorical data. J. Office Statist. 14, 385--397.
|
| |
14
|
FIMI. 2003. The FIMI repository. http://fimi.cs.helsinki.fi/fimi03/http://fimi.cs.helsinki.fi/fimi03/.
|
 |
15
|
|
| |
16
|
Hettich, S. and Bay, S. D. 1999. The UCI KDD archive. University of California, Irvine, CA. http://kdd.ics.uci.edu.
|
 |
17
|
|
| |
18
|
Jerrum, M. and Vazirani, U. 1996. A mildly exponential approximation algorithm for the permanent. Algorithmica 16, 4-5, 392--401.
|
 |
19
|
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
Moore, R. A. 1996. Controlled data-swapping techniques for masking public use micro-data sets. In Statistical Research Division Report Series, RR 96-04, U.S. Bureau of the Census, Washington, DC.
|
 |
25
|
|
 |
26
|
Raymond T. Ng , Laks V. S. Lakshmanan , Jiawei Han , Alex Pang, Exploratory mining and pruning optimizations of constrained associations rules, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.13-24, June 01-04, 1998, Seattle, Washington, United States
|
| |
27
|
Ramesh, G. 2005. Can attackers learn from samples? In Proceedings of the Very Large Databases (VLDB) Workshop on Secure Data Management (SDM), 104--123.
|
| |
28
|
Rasmussen, L. E. 1994. Approximating the permanent: A simple approach. Random Structures Algor. 5, 2, 349--361.
|
| |
29
|
Samarati, P. and Sweeny, L. 1998. Protecting privacy when disclosing information: k-Anonymity and its enforcement through generalization and suppression. In the IEEE Symposium on Research in Security and Privacy.
|
| |
30
|
|
| |
31
|
Valiant, L. G. 1979. The complexity of computing the permanent. Theoret. Comput. Sci. 8, 189--201.
|
| |
32
|
|
| |
33
|
Winkler, W. E. 1999. The state of record linkage and current research problems. Res. Rep., U.S. Census Bureau.
|
| |
34
|
Winkler, W. E. 1994. Advanced methods for record linkage. Res. Rep., U.S. Census Bureau.
|
| |
35
|
Winkler, W. E. 1993. Matching and record linkage. Res. Rep., U.S. Census Bureau.
|
| |
36
|
|
INDEX TERMS
Primary Classification:
E.
Data
E.m
MISCELLANEOUS
General Terms:
Algorithms,
Security
Keywords:
Disclosure risk,
anonymization,
belief function,
bipartite graphs,
correlation,
frequent itemsets,
hacker,
matching,
prior knowledge,
sampling
|