ACM Home Page
Please provide us with feedback. Feedback
Approximation algorithms for clustering uncertain data
Full text PdfPdf (235 KB)
Source
Symposium on Principles of Database Systems archive
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Vancouver, Canada
SESSION: Searching & clustering table of contents
Pages 191-200  
Year of Publication: 2008
ISBN:978-1-60558-152-1
Authors
Graham Cormode  AT&T Labs, Florham Park, NJ, USA
Andrew McGregor  UC San Diego, La Jolla, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 31,   Downloads (12 Months): 353,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376916.1376944
What is a DOI?

ABSTRACT

There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such data in a DBMS, it is necessary to perform other data analysis tasks such as data mining. We study the core mining problem of clustering on uncertain data, and define appropriate natural generalizations of standard clustering optimization criteria. Two variations arise, depending on whether a point is automatically associated with its optimal center, or whether it must be assigned to a fixed cluster no matter where it is actually located.

For uncertain versions of k-means and k-median, we show reductions to their corresponding weighted versions on data with no uncertainties. These are simple in the unassigned case, but require some care for the assigned version. Our most interesting results are for uncertain k-center, which generalizes both traditional k-center and k-median objectives. We show a variety of bicriteria approximation algorithms. One picks O(kε--1log2n) centers and achieves a (1 + ε) approximation to the best uncertain k-centers. Another picks 2k centers and achieves a constant factor approximation. Collectively, these results are the first known guaranteed approximation algorithms for the problems of clustering uncertain data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. Aggarwal and P. S. Yu. Framework for clustering uncertain data streams. In IEEE International Conference on Data Engineering, 2008.
 
2
 
3
4
 
5
 
6
 
7
M. Chau, R. Cheng, B. Kao, and J. Ngai. Uncertain data mining: An example in clustering location data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2006.
8
 
9
 
10
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38, 1977.
 
11
J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3:32--57, 1973.
 
12
M. Dyer and A. Frieze. A simple heuristic for the p-center problem. Operations Research Letters, 3:285--288, 1985.
 
13
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, page 226, 1996.
14
 
15
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38(2-3):293--306, 1985.
16
 
17
S. Har-Peled. Geometric approximation algorithms. http://valis.cs.uiuc.edu/~sariel/teach/notes/aprx/book.pdf, 2007.
18
 
19
D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, May 1985.
20
21
22
 
23
 
24
 
25
J. B. MacQueen. Some method for the classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Structures, pages 281--297, 1967.
 
26
 
27
28


Collaborative Colleagues:
Graham Cormode: colleagues
Andrew McGregor: colleagues