| Hierarchical model-based clustering of large datasets through fractionation and refractionation |
| Full text |
Pdf
(642 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Edmonton, Alberta, Canada
SESSION: Statistical methods II
table of contents
Pages: 183 - 190
Year of Publication: 2002
ISBN:1-58113-567-X
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 45, Citation Count: 5
|
|
|
ABSTRACT
The goal of clustering is to identify distinct groups in a dataset. Compared to non-parametric clustering methods like complete linkage, hierarchical model-based clustering has the advantage of offering a way to estimate the number of groups present in the data. However, its computational cost is quadratic in the number of items to be clustered, and it is therefore not applicable to large problems. We review an idea called Fractionation, originally conceived by Cutting, Karger, Pedersen and Tukey for non-parametric hierarchical clustering of large datasets, and describe an adaptation of Fractionation to model-based clustering. A further extension, called Refractionation, leads to a procedure that can be successful even in the difficult situation where there are large numbers of small groups.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study final report, 1998.
|
 |
2
|
Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
J. D. Banfield and A. Raftery. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803--821, 1993.
|
| |
4
|
|
| |
5
|
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large datasets. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD98), 1998.
|
| |
6
|
P. Bradley, U. Fayyad, and C. Reina. Scaling EM (expectation-maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research, 1999.
|
 |
7
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
8
|
P. Domingos and G. Hulten. Learning from infinite data in finite time. In Advances in Neural Information Processing Systems 14. 2002.
|
| |
9
|
S. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments & Computers, 23(2):229--236, 1991.
|
| |
10
|
M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226--231, 1996.
|
| |
11
|
E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchical clusterings. J. American Statistical Association, 78:553--569, 1983.
|
| |
12
|
C. Fraley and A. Raftery. How many clusters? which clustering method? - answers via model-based cluster analysis. The Computer Journal, 41:578--588, 1998.
|
| |
13
|
J. Hartigan. Statistical theory in clustering. Journal of Classification, 2:63--76, 1985.
|
| |
14
|
G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8(1):65--74, 1997.
|
| |
15
|
G. McLachlan and D. Peel. Finite Mixture Models. John Wiley & Sons, 2000.
|
| |
16
|
G. Schwartz. Estimating the dimension of a model. Annals of Statistics, 6:497--511, 1978.
|
| |
17
|
D. Scott. Multivariate Density Estimation. Wiley, 1992.
|
| |
18
|
D. Wishart. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In A. Cole, editor, Numerical Taxonomy, pages 282--311. Academic Press, 1969.
|
|