|
ABSTRACT
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain "bag of words". The contribution of the paper is three-fold. First a general binary data clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. We characterize several variations with different optimization procedures for the general model. Second, we also establish the connections between our clustering model with other existing clustering methods. Third, we also discuss the problem for determining the number of clusters for binary clustering. Experimental results show the effectiveness of the proposed clustering model.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Charu C. Aggarwal , Joel L. Wolf , Philip S. Yu , Cecilia Procopiuc , Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
2
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
| |
3
|
Baier, D., Gaul, W., & Schader, M. (1997). Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In R. Klar and O. Opitz (Eds.), Classification and knowledge organization, 577--566. Springer.
|
| |
4
|
Baulieu, F. B. (1997). Two variant axiom systems for presence/absence based dissimilarity coefficients. Journal of Classification, 14, 159--170.
|
| |
5
|
Daniel Boley , Maria Gini , Robert Gross , Eui-Hong (Sam) Han , Kyle Hastings , George Karypis , Vipin Kumar , Bamshad Mobasher , Jerome Moore, Document Categorization and Query Generation on the World Wide WebUsing WebACE, Artificial Intelligence Review, v.13 n.5-6, p.365-391, Dec. 1999
[doi> 10.1023/A:1006592405320]
|
| |
6
|
Castillo, W., & Trejos, J. (2002). Two-mode partitioning: Review of methods and application and tabu search. In K. Jajuga, A. Sokolowski and H.-H. Bock (Eds.), Classification, clustering and data analysis, 43--51. Springer.
|
| |
7
|
Cho, H., Dhillon, I. S., Guan, Y., & Sra, S. (2004). Minimum sum-squared residue co-clustering of gene experssion data. Proceedings of the SIAM Data Mining Conference.
|
| |
8
|
Desarbo, W. (1982). GENNCLUS: New models for general nonhierarchical clustering analysis. Psuchometrika, 47, 449--475.
|
| |
9
|
Deuflhard, P., Huisinga, W., Fischer, A., & Schutte, C. (2000). Identification of almost invariant aggregates in reversible nearly coupled markov chain. Linear Algebra and Its Applications, 315, 39--59.
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
Golub, G. H., & Loan, C. F. V. (1991). Matrix computations. The Johns Hopkins University Press.
|
| |
15
|
Govaert, G. (1995). Simultaneous clustering of rows and columns. Control and Cybernetics, 24, 437--458.
|
 |
16
|
Eui-Hong Han , Daniel Boley , Maria Gini , Robert Gross , Kyle Hastings , George Karypis , Vipin Kumar , Bamshad Mobasher , Jerome Moore, WebACE: a Web agent for document categorization and exploration, Proceedings of the second international conference on Autonomous agents, p.408-415, May 10-13, 1998, Minneapolis, Minnesota, United States
[doi> 10.1145/280765.280872]
|
| |
17
|
|
| |
18
|
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer.
|
| |
19
|
|
| |
20
|
Kato, T. (1995). Perturbation theory for linear operators. Springer.
|
 |
21
|
|
| |
22
|
Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. NIPS (pp. 556--562).
|
| |
23
|
Li, T., & Ma, S. (2004). IFD:iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004). SIAM.
|
 |
24
|
Tao Li , Sheng Ma , Mitsunori Ogihara, Entropy-based criterion in categorical clustering, Proceedings of the twenty-first international conference on Machine learning, p.68, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015404]
|
| |
25
|
Li, T., & Zhu, S. (2005). On clustering binary data. Proceedings of the 2005 SIAM International Conference On Data Mining(SDM'05) (pp. 526--530).
|
| |
26
|
Maris, E., Boeck, P. D., & Mechelen, I. V. (1996). Probability matrix decomposition models. Psychometrika, 61, 7--29.
|
| |
27
|
Maurizio, V. (2001). Double k-means clustering for simultaneous classification of objects and variables. In S. Borra, R. Rocci, M. Vichi and M. Schader (Eds.), Advances in classification and data analysis, 43--52. Springer.
|
| |
28
|
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.
|
| |
29
|
Mickey, M. R., Mundle, P., & Engelman, L. (1988). Boolean factor analysis. In Bmdp statistical software manual, vol. 2, 789--800. University of California Press.
|
| |
30
|
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5, 111--126.
|
| |
31
|
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465--471.
|
| |
32
|
Sha, F., Saul, L. K., & Lee, D. D. (2002). Multiplicative updates for nonegative quadratic programming in support vector machines. Advances in Neural Information Processing Systems (pp. 1065--1072).
|
| |
33
|
Shepard, R. N., & Arabie, P. (1979). Additive clustering: Representation of similarities as combinations of discrete overlapping properties. Psychological Review, 86, 87--123.
|
 |
34
|
|
| |
35
|
Soete, G., & Carroll, J. D. (1994). K-means clustering in a low-dimensional eucildean space. New Approaches in Classification and Data Analysis (pp. 212--219). Springer-Verlag.
|
| |
36
|
Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing (pp. 368--377).
|
 |
37
|
|
 |
38
|
|
| |
39
|
Zha, H., He, X., Ding, C., & Simon, H. (2001). Spectral relaxation for k-means clustering. Proceedings of Neural Information Processing Systems.
|
| |
40
|
Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets (Technical Report). Department of Computer Science, University of Minnesota.
|
| |
41
|
|
CITED BY 16
|
|
|
|
|
Chris Ding , Tao Li , Wei Peng , Haesun Park, Orthogonal nonnegative matrix t-factorizations for clustering, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
Bo Long , Zhongfei (Mark) Zhang , Xiaoyun Wú , Philip S. Yu, Spectral clustering for multi-type relational data, Proceedings of the 23rd international conference on Machine learning, p.585-592, June 25-29, 2006, Pittsburgh, Pennsylvania
|
|
|
Bo Long , Xiaoyun Wu , Zhongfei (Mark) Zhang , Philip S. Yu, Unsupervised learning on k-partite graphs, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
Bo Long , Zhongfei (Mark) Zhang , Xiaoyun Wu , Philip S. Yu, Relational clustering by symmetric convex coding, Proceedings of the 24th international conference on Machine learning, p.569-576, June 20-24, 2007, Corvalis, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
Yun Mao , Hani Jamjoom , Shu Tao , Jonathan M. Smith, Networkmd: topology inference and failure diagnosis in the last mile, Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, October 24-26, 2007, San Diego, California, USA
|
|
|
|
|
|
Yang Xiang , Ruoming Jin , David Fuhry , Feodor F. Dragan, Succinct summarization of transactional databases: an overlapped hyperrectangle scheme, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
Chris Ding , Tao Li , Wei Peng, Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method, Proceedings of the 21st national conference on Artificial intelligence, p.342-347, July 16-20, 2006, Boston, Massachusetts
|
|