|
ABSTRACT
Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality of solutions and speed. In general the algorithm can find a good clustering solution in 3 scans over the data set. Alternatively, it can be run until it converges. The algorithm has a few parameters that are easy to set and have defaults for most cases. The proposed algorithm is compared against the standard EM algorithm and the On-Line EM algorithm.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
| |
4
|
|
| |
5
|
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, 1998.
|
| |
6
|
P. Bradley, U. Fayyad, and C. Reina. Scaling EM clustering to large databases. Technical report, Microsoft Research, 1999.
|
 |
7
|
Markus M. Breunig , Hans-Peter Kriegel , Peer Kröger , Jörg Sander, Data bubbles: quality preserving performance boosting for hierarchical clustering, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.79-90, May 21-24, 2001, Santa Barbara, California, United States
|
| |
8
|
A.P. Dempster, N.M. Laird, and D. Rubin. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of The Royal Statistical Society, 39(1):1--38, 1977.
|
| |
9
|
R. Dubes and A.K. Jain. Clustering Methodologies in Exploratory Data Analysis, pages 10--35. Academic Press, New York, 1980.
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
 |
13
|
Venkatesh Ganti , Johannes Gehrke , Raghu Ramakrishnan, CACTUS—clustering categorical data using summaries, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.73-83, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312201]
|
 |
14
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
15
|
S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In ICDE Conference, 1999.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
G.J. MacLachlan and T. Krishnan. The EM Algorithm and Extensions, pages 120--211. Wiley, New York, 1997.
|
| |
21
|
|
| |
22
|
R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse and other variants. Technical report, Dept. of Statistics, University of Toronto, 1993.
|
| |
23
|
|
| |
24
|
|
 |
25
|
|
| |
26
|
Carlos Ordonez , Edward Omiecinski , Levien de Braal , Cesar A. Santana , Norberto Ezquerra , Jose A. Taboada , David Cooke , Elizabeth Krawczynska , Ernest V. Garcia, Mining Constrained Association Rules to Predict Heart Disease, Proceedings of the 2001 IEEE International Conference on Data Mining, p.433-440, November 29-December 02, 2001
|
| |
27
|
|
 |
28
|
|
| |
29
|
R.A. Redner and H.F. Walker. Mixure densities, maximum likelihood, and the EM algorithm. SIAM Review, 26:195--239, 1984.
|
| |
30
|
|
| |
31
|
|
| |
32
|
D. Scott. Multivariate Density Estimation, pages 10--130. J. Wiley and Sons, New York, 1992.
|
| |
33
|
|
| |
34
|
Lei Xu , Michael I. Jordan, On convergence properties of the EM algorithm for Gaussian mixtures, Neural Computation, v.8 n.1, p.129-151, Jan. 1996
|
| |
35
|
|
 |
36
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|