|
ABSTRACT
In this paper we develop an efficient implementation for a k-means clustering algorithm. Our algorithm is a variant of KMHybrid [28, 20], i.e. it uses a combination of Lloyd-steps and random swaps, but as a novel feature it uses coresets to speed up the algorithm. A coreset is a small weighted set of points that approximates the original point set with respect to the considered problem. The main strength of the algorithm is that it can quickly determine clusterings of the same point set for many values of k. This is necessary in many applications, since, typically, one does not know a good value for k in advance. Once we have clusterings for many different values of k we can determine a good choice of k using a quality measure of clusterings that is independent of k, for example the average silhouette coefficient. The average silhouette coefficient can be approximated using coresets.To evaluate the performance of our algorithm we compare it with algorithm KMHybrid [28] on typical 3D data sets for an image compression application and on artificially created instances. Our data sets consist of 300,000 to 4.9 million points. We show that our algorithm significantly outperforms KMHybrid on most of these input instances. Additionally, the quality of the solutions computed by our algorithm deviates less than that of KMHybrid.We also computed clusterings and approximate average silhouette coefficient for k=1,…,100 for our input instances and discuss the performance of our algorithm in detail.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
P. Agarwal, S. Har-Peled, and K. Varadarajan. Geometric Approximation via Coresets. Survey available at http://valis.cs.uiuc.edu/sariel/research/papers/04/survey/survey.pdf
|
| |
3
|
K. Alsabti, S. Ranka, and V. Singh. An Efficient k-Means Clustering Algorithm. Proceedings of the first Workshop on High Performance Data Mining, 1998.
|
| |
4
|
|
 |
5
|
|
| |
6
|
P. Berkhin. Survey of Clustering Data Mining Techniques. Available at ..., 2002.
|
 |
7
|
|
| |
8
|
A.Czumaj and C. Sohler. Sublinear-Time Approximation for Clustering via Random Sampling. Proceedings of the 31st Annual International Colloquium on Autonate, Languages and Programming (ICALP'04), pp. 396--407, 2004.
|
 |
9
|
W. Fernandez de la Vega , Marek Karpinski , Claire Kenyon , Yuval Rabani, Approximation schemes for clustering problems, Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, June 09-11, 2003, San Diego, CA, USA
[doi> 10.1145/780542.780550]
|
| |
10
|
E. Forgey. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification. Biometrics, 21:768, 1965.
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
 |
17
|
Mary Inaba , Naoki Katoh , Hiroshi Imai, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract), Proceedings of the tenth annual symposium on Computational geometry, p.332-339, June 06-08, 1994, Stony Brook, New York, United States
[doi> 10.1145/177424.178042]
|
| |
18
|
|
| |
19
|
Tapas Kanungo , David M. Mount , Nathan S. Netanyahu , Christine D. Piatko , Ruth Silverman , Angela Y. Wu, An Efficient k-Means Clustering Algorithm: Analysis and Implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.24 n.7, p.881-892, July 2002
[doi> 10.1109/TPAMI.2002.1017616]
|
 |
20
|
Tapas Kanungo , David M. Mount , Nathan S. Netanyahu , Christine D. Piatko , Ruth Silverman , Angela Y. Wu, A local search approximation algorithm for k-means clustering, Proceedings of the eighteenth annual symposium on Computational geometry, p.10-18, June 05-07, 2002, Barcelona, Spain
[doi> 10.1145/513400.513402]
|
| |
21
|
D. Knuth. The Art of Computer Programming: Sorting and Searching, Vol. 3, Addison-Wesley, 1973.
|
| |
22
|
|
| |
23
|
S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28: 129--137, 1982.
|
| |
24
|
J. Matoušek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1): 61--84, 2000.
|
| |
25
|
J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pp. 281--296, 1967.
|
| |
26
|
|
| |
27
|
Nina Mishra , Dan Oblinger , Leonard Pitt, Sublinear time approximate clustering, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.439-447, January 07-09, 2001, Washington, D.C., United States
|
| |
28
|
D. Mount. KMlocal: A Testbed for k-means Clustering Algorithms. Available at http://www.cs.umd.edu/ mount/Projects/KMeans/km-localdoc.pdf
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
S. Selim and M. Ismail. k-Means-Type Algorithms: A Generalized Convergence Theorem and Characterizations of Local Optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6::81--87, 1984.
|
| |
33
|
|
|