ACM Home Page
Please provide us with feedback. Feedback
Workload-aware anonymization techniques for large-scale datasets
Full text PdfPdf (1.39 MB)
Source
ACM Transactions on Database Systems (TODS) archive
Volume 33 ,  Issue 3  (August 2008) table of contents
Article No. 17  
Year of Publication: 2008
ISSN:0362-5915
Authors
Kristen LeFevre  University of Michigan
David J. DeWitt  Microsoft
Raghu Ramakrishnan  Yahoo! Research
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 209,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1386118.1386123
What is a DOI?

ABSTRACT

Protecting individual privacy is an important problem in microdata distribution and publishing. Anonymization algorithms typically aim to satisfy certain privacy definitions with minimal impact on the quality of the resulting data. While much of the previous literature has measured quality through simple one-size-fits-all measures, we argue that quality is best judged with respect to the workload for which the data will ultimately be used.

This article provides a suite of anonymization algorithms that incorporate a target class of workloads, consisting of one or more data mining tasks as well as selection predicates. An extensive empirical evaluation indicates that this approach is often more effective than previous techniques. In addition, we consider the problem of scalability. The article describes two extensions that allow us to scale the anonymization algorithms to datasets much larger than main memory. The first extension is based on ideas from scalable decision trees, and the second is based on sampling. A thorough performance evaluation indicates that these techniques are viable in practice.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Aggarwal, C. and Yu, P. 2004. A condensation approach to privacy-preserving data mining. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT).
 
3
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., and Zhu, A. 2005. Anonymizing tables. In Proceedings of the 10th International Conference on Database Theory (ICDT).
4
 
5
6
 
7
 
8
Blake, C. and Merz, C. 1998. UCI repository of machine learning databases. University of California Irvine.
9
 
10
Breiman, L., Freidman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth International Group, Belmont, CA.
 
11
Chawla, S., Dwork, C., McSherry, F., Smith, A., and Wee, H. 2005. Toward privacy in public databases. In Proceedings of the 2nd Theory of Cryptography Conference.
 
12
 
13
 
14
 
15
Dwork, C. 2006. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages, and Programming (ICALP).
 
16
Dwork, C., McSherry, F., Nissim, K., and Smith, A. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference.
17
 
18
19
 
20
 
21
HIP. 2002. Standards for privacy of individuals identifiable health information. U.S. Department of Health and Human Services.
 
22
23
24
25
26
 
27
28
 
29
Li, N., Li, T., and Venkatasubramanian, S. 2007. t-Closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the IEEE International Conference on Data Engineering (ICDE).
 
30
 
31
Martin, D., Kifer, D., Machanavajjhala, A., Gehrke, J., and Halpern, J. 2007. Worst-case background knowledge in privacy. In Proceedings of the IEEE International Conference on Data Engineering (ICDE).
32
33
 
34
 
35
 
36
 
37
 
38
 
39
40
 
41
 
42
43
44
 
45
 
46
Zhang, J. and Honavar, V. 2003. Learning decision tree classifiers from attribute value taxonomies and partially specified data. In Proceedings of the 20th International Conference on Machine Learning (ICML).



REVIEW

"Aris Gkoulalas-Divanis : Reviewer"

The release of microdata to third parties raises important questions regarding the privacy of the individuals whose information is recorded in the dataset. To meet these privacy concerns, many anonymization algorithms for microdata have been propo  more...

Collaborative Colleagues:
Kristen LeFevre: colleagues
David J. DeWitt: colleagues
Raghu Ramakrishnan: colleagues