ACM Home Page
Please provide us with feedback. Feedback
A general model for clustering binary data
Full text PdfPdf (629 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining table of contents
Chicago, Illinois, USA
SESSION: Research track paper table of contents
Pages: 188 - 197  
Year of Publication: 2005
ISBN:1-59593-135-X
Author
Tao Li  Florida International University, Miami, FL
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 29,   Downloads (12 Months): 270,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1081870.1081894
What is a DOI?

ABSTRACT

Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain "bag of words". The contribution of the paper is three-fold. First a general binary data clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. We characterize several variations with different optimization procedures for the general model. Second, we also establish the connections between our clustering model with other existing clustering methods. Third, we also discuss the problem for determining the number of clusters for binary clustering. Experimental results show the effectiveness of the proposed clustering model.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
Baier, D., Gaul, W., & Schader, M. (1997). Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In R. Klar and O. Opitz (Eds.), Classification and knowledge organization, 577--566. Springer.
 
4
Baulieu, F. B. (1997). Two variant axiom systems for presence/absence based dissimilarity coefficients. Journal of Classification, 14, 159--170.
 
5
 
6
Castillo, W., & Trejos, J. (2002). Two-mode partitioning: Review of methods and application and tabu search. In K. Jajuga, A. Sokolowski and H.-H. Bock (Eds.), Classification, clustering and data analysis, 43--51. Springer.
 
7
Cho, H., Dhillon, I. S., Guan, Y., & Sra, S. (2004). Minimum sum-squared residue co-clustering of gene experssion data. Proceedings of the SIAM Data Mining Conference.
 
8
Desarbo, W. (1982). GENNCLUS: New models for general nonhierarchical clustering analysis. Psuchometrika, 47, 449--475.
 
9
Deuflhard, P., Huisinga, W., Fischer, A., & Schutte, C. (2000). Identification of almost invariant aggregates in reversible nearly coupled markov chain. Linear Algebra and Its Applications, 315, 39--59.
10
 
11
 
12
 
13
 
14
Golub, G. H., & Loan, C. F. V. (1991). Matrix computations. The Johns Hopkins University Press.
 
15
Govaert, G. (1995). Simultaneous clustering of rows and columns. Control and Cybernetics, 24, 437--458.
16
 
17
 
18
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer.
 
19
 
20
Kato, T. (1995). Perturbation theory for linear operators. Springer.
21
 
22
Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. NIPS (pp. 556--562).
 
23
Li, T., & Ma, S. (2004). IFD:iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004). SIAM.
24
 
25
Li, T., & Zhu, S. (2005). On clustering binary data. Proceedings of the 2005 SIAM International Conference On Data Mining(SDM'05) (pp. 526--530).
 
26
Maris, E., Boeck, P. D., & Mechelen, I. V. (1996). Probability matrix decomposition models. Psychometrika, 61, 7--29.
 
27
Maurizio, V. (2001). Double k-means clustering for simultaneous classification of objects and variables. In S. Borra, R. Rocci, M. Vichi and M. Schader (Eds.), Advances in classification and data analysis, 43--52. Springer.
 
28
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.
 
29
Mickey, M. R., Mundle, P., & Engelman, L. (1988). Boolean factor analysis. In Bmdp statistical software manual, vol. 2, 789--800. University of California Press.
 
30
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5, 111--126.
 
31
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465--471.
 
32
Sha, F., Saul, L. K., & Lee, D. D. (2002). Multiplicative updates for nonegative quadratic programming in support vector machines. Advances in Neural Information Processing Systems (pp. 1065--1072).
 
33
Shepard, R. N., & Arabie, P. (1979). Additive clustering: Representation of similarities as combinations of discrete overlapping properties. Psychological Review, 86, 87--123.
34
 
35
Soete, G., & Carroll, J. D. (1994). K-means clustering in a low-dimensional eucildean space. New Approaches in Classification and Data Analysis (pp. 212--219). Springer-Verlag.
 
36
Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing (pp. 368--377).
37
38
 
39
Zha, H., He, X., Ding, C., & Simon, H. (2001). Spectral relaxation for k-means clustering. Proceedings of Neural Information Processing Systems.
 
40
Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets (Technical Report). Department of Computer Science, University of Minnesota.
 
41

CITED BY  16