ACM Home Page
Please provide us with feedback. Feedback
Hierarchical model-based clustering of large datasets through fractionation and refractionation
Full text PdfPdf (642 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
SESSION: Statistical methods II table of contents
Pages: 183 - 190  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Jeremy Tantrum  University of Washington, Seattle, WA
Alejandro Murua  Insightful Corporation, Seattle, WA
Werner Stuetzle  University of Washington, Seattle, WA
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 45,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775074
What is a DOI?

ABSTRACT

The goal of clustering is to identify distinct groups in a dataset. Compared to non-parametric clustering methods like complete linkage, hierarchical model-based clustering has the advantage of offering a way to estimate the number of groups present in the data. However, its computational cost is quadratic in the number of items to be clustered, and it is therefore not applicable to large problems. We review an idea called Fractionation, originally conceived by Cutting, Karger, Pedersen and Tukey for non-parametric hierarchical clustering of large datasets, and describe an adaptation of Fractionation to model-based clustering. A further extension, called Refractionation, leads to a procedure that can be successful even in the difficult situation where there are large numbers of small groups.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study final report, 1998.
2
 
3
J. D. Banfield and A. Raftery. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803--821, 1993.
 
4
 
5
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large datasets. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD98), 1998.
 
6
P. Bradley, U. Fayyad, and C. Reina. Scaling EM (expectation-maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research, 1999.
7
 
8
P. Domingos and G. Hulten. Learning from infinite data in finite time. In Advances in Neural Information Processing Systems 14. 2002.
 
9
S. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments & Computers, 23(2):229--236, 1991.
 
10
M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226--231, 1996.
 
11
E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchical clusterings. J. American Statistical Association, 78:553--569, 1983.
 
12
C. Fraley and A. Raftery. How many clusters? which clustering method? - answers via model-based cluster analysis. The Computer Journal, 41:578--588, 1998.
 
13
J. Hartigan. Statistical theory in clustering. Journal of Classification, 2:63--76, 1985.
 
14
G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8(1):65--74, 1997.
 
15
G. McLachlan and D. Peel. Finite Mixture Models. John Wiley & Sons, 2000.
 
16
G. Schwartz. Estimating the dimension of a model. Annals of Statistics, 6:497--511, 1978.
 
17
D. Scott. Multivariate Density Estimation. Wiley, 1992.
 
18
D. Wishart. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In A. Cole, editor, Numerical Taxonomy, pages 282--311. Academic Press, 1969.


Collaborative Colleagues:
Jeremy Tantrum: colleagues
Alejandro Murua: colleagues
Werner Stuetzle: colleagues