ACM Home Page
Please provide us with feedback. Feedback
Achieving anonymity via clustering
Full text PdfPdf (229 KB)
Source Symposium on Principles of Database Systems archive
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Chicago, IL, USA
SESSION: Popularity and privacy table of contents
Pages: 153 - 162  
Year of Publication: 2006
ISBN:1-59593-318-2
Authors
Gagan Aggarwal  Google Inc., Mountain View, CA
Tomás Feder  Stanford University, Stanford, CA
Krishnaram Kenthapadi  Stanford University, Stanford, CA
Samir Khuller  University of Maryland, College Park, MD
Rina Panigrahy  Stanford University, Stanford, CA
Dilys Thomas  Stanford University, Stanford, CA
An Zhu  Google Inc., Mountain View, CA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGART: ACM Special Interest Group on Artificial Intelligence
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 113,   Citation Count: 19
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1142351.1142374
What is a DOI?

ABSTRACT

Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of de-identifying records is to remove identifying fields such as social security number, name etc. However, recent research has shown that a large fraction of the US population can be identified using non-key attributes (called quasi-identifiers) such as date of birth, gender, and zip code [15]. Sweeney [16] proposed the k-anonymity model for privacy where non-key attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k−1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-Anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant-factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an ε fraction of points to remain unclustered, i.e., deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. Approximation Algorithms for k-Anonymity. Journal of Privacy Technology, Paper number: 20051120001, 2005.
 
2
 
3
 
4
 
5
S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Toward Privacy in Public Databases. In Proceedings of the Theory of Cryptography Conference, pages 363--385, 2005.
 
6
 
7
 
8
D. Hochbaum and D. Shmoys. A best possible approximation algorithm for the k-center problem. Mathematics of Operations Research, 10(2), pages 180--184, 1985.
 
9
 
10
 
11
12
 
13
14
 
15
L. Sweeney. Uniqueness of Simple Demographics in the U.S. Population. LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA, 2000.
 
16
 
17
Time. The Death of Privacy, August 1997.

CITED BY  20

Collaborative Colleagues:
Gagan Aggarwal: colleagues
Tomás Feder: colleagues
Krishnaram Kenthapadi: colleagues
Samir Khuller: colleagues
Rina Panigrahy: colleagues
Dilys Thomas: colleagues
An Zhu: colleagues