|
ABSTRACT
Decision makers of companies often face the dilemma of whether to release data for knowledge discovery, vis a vis the risk of disclosing proprietary or sensitive information. While there are various "sanitization" methods, in this paper we focus on anonymization, given its widespread use in practice. We give due diligence to the question of "just how safe the anonymized data is", in terms of protecting the true identities of the data objects. We consider both the scenarios when the hacker has no information, and more realistically, when the hacker may have partial information about items in the domain. We conduct our analyses in the context of frequent set mining. We propose to capture the prior knowledge of the hacker by means of a belief function, where an educated guess of the frequency of each item is assumed. For various classes of belief functions, which correspond to different degrees of prior knowledge, we derive formulas for computing the expected number of "cracks". While obtaining the exact values for the more general situations is computationally hard, we propose a heuristic called the O-estimate. It is easy to compute, and is shown to be accurate empirically with real benchmark datasets. Finally, based on the O-estimates, we propose a recipe for the decision makers to resolve their dilemma.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Aggarwal C. C and Yu P. S. A Condensation Approach to Privacy Preserving Data Mining. EDBT, 2004.
|
| |
3
|
Aggarwal. G et. al. Anonymizing Tables. ICDT Conference, 2005.
|
 |
4
|
|
 |
5
|
|
 |
6
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
Fienberg S. E. et. al. Disclosure Limitation using perturbation and related methods for Categorical Data. Journal of Office Statistics, 14, 1998.
|
 |
12
|
|
 |
13
|
|
| |
14
|
M. Jerrum and U. Vazirani. A mildly exponential approximation algorithm for the permanent. Algorithmica 16, 1996.
|
| |
15
|
|
| |
16
|
|
| |
17
|
Moore Jr. R. A. Controlled Data-Swapping Techniques for Masking Public Use Microdata Sets. Statistical Research Division Report Series, RR 96-04, US Bureau of Census, Washington D. C., 1996.
|
 |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
Rasmussen L. E. Approximating the permanent: A simple approach. Random Structures and Algorithms, 5, 1994.
|
| |
22
|
Samarati P. and Sweeney L. Protecting Privacy when Disclosing Information: k-anonymity and its Enforcement through Generalization and Suppression. IEEE Symposium on Research in Security and Privacy, 1998.
|
| |
23
|
|
 |
24
|
|
| |
25
|
Valiant L. G. The Complexity of Computing the Permanent. Theoretical Computer Science, 8, 1979.
|
| |
26
|
|
| |
27
|
Yang X. and Li C. Secure XML Publishing without Information Leakage in the Presence of Data Inference. VLDB Conference, 2004.
|
|