|
ABSTRACT
There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such data in a DBMS, it is necessary to perform other data analysis tasks such as data mining. We study the core mining problem of clustering on uncertain data, and define appropriate natural generalizations of standard clustering optimization criteria. Two variations arise, depending on whether a point is automatically associated with its optimal center, or whether it must be assigned to a fixed cluster no matter where it is actually located. For uncertain versions of k-means and k-median, we show reductions to their corresponding weighted versions on data with no uncertainties. These are simple in the unassigned case, but require some care for the assigned version. Our most interesting results are for uncertain k-center, which generalizes both traditional k-center and k-median objectives. We show a variety of bicriteria approximation algorithms. One picks O(kε--1log2n) centers and achieves a (1 + ε) approximation to the best uncertain k-centers. Another picks 2k centers and achieves a constant factor approximation. Collectively, these results are the first known guaranteed approximation algorithms for the problems of clustering uncertain data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C. Aggarwal and P. S. Yu. Framework for clustering uncertain data streams. In IEEE International Conference on Data Engineering, 2008.
|
| |
2
|
|
| |
3
|
Vijay Arya , Naveen Garg , Rohit Khandekar , Adam Meyerson , Kamesh Munagala , Vinayaka Pandit, Local Search Heuristics for k-Median and Facility Location Problems, SIAM Journal on Computing, v.33 n.3, p.544-562, 2004
[doi> 10.1137/S0097539702416402]
|
 |
4
|
|
| |
5
|
|
| |
6
|
Moses Charikar , Samir Khuller , David M. Mount , Giri Narasimhan, Algorithms for facility location problems with outliers, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.642-651, January 07-09, 2001, Washington, D.C., United States
|
| |
7
|
M. Chau, R. Cheng, B. Kao, and J. Ngai. Uncertain data mining: An example in clustering location data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2006.
|
 |
8
|
|
| |
9
|
|
| |
10
|
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38, 1977.
|
| |
11
|
J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3:32--57, 1973.
|
| |
12
|
M. Dyer and A. Frieze. A simple heuristic for the p-center problem. Operations Research Letters, 3:285--288, 1985.
|
| |
13
|
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, page 226, 1996.
|
 |
14
|
|
| |
15
|
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38(2-3):293--306, 1985.
|
 |
16
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
17
|
S. Har-Peled. Geometric approximation algorithms. http://valis.cs.uiuc.edu/~sariel/teach/notes/aprx/book.pdf, 2007.
|
 |
18
|
|
| |
19
|
D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, May 1985.
|
 |
20
|
Mary Inaba , Naoki Katoh , Hiroshi Imai, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract), Proceedings of the tenth annual symposium on Computational geometry, p.332-339, June 06-08, 1994, Stony Brook, New York, United States
[doi> 10.1145/177424.178042]
|
 |
21
|
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
J. B. MacQueen. Some method for the classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Structures, pages 281--297, 1967.
|
| |
26
|
Wang Kay Ngai , Ben Kao , Chun Kit Chui , Reynold Cheng , Michael Chau , Kevin Y. Yip, Efficient Clustering of Uncertain Data, Proceedings of the Sixth International Conference on Data Mining, p.436-445, December 18-22, 2006
[doi> 10.1109/ICDM.2006.63]
|
| |
27
|
|
 |
28
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|