|
ABSTRACT
Attribute data and relationship data are two principal types of data, representing the intrinsic and extrinsic properties of entities. While attribute data have been the main source of data for cluster analysis, relationship data such as social networks or metabolic networks are becoming increasingly available. It is also common to observe both data types carry complementary information such as in market segmentation and community identification, which calls for a joint cluster analysis of both data types so as to achieve better results. In this article, we introduce the novel Connected k-Center (CkC) problem, a clustering model taking into account attribute data as well as relationship data. We analyze the complexity of the problem and prove its NP-hardness. Therefore, we analyze the approximability of the problem and also present a constant factor approximation algorithm. For the special case of the CkC problem where the relationship data form a tree structure, we propose a dynamic programming method giving an optimal solution in polynomial time. We further present NetScan, a heuristic algorithm that is efficient and effective for large real databases. Our extensive experimental evaluation on real datasets demonstrates the meaningfulness and accuracy of the NetScan results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Agarwal, P. and Procopiuc, C. M. 2002. Exact and approximation algorithms for clustering. Algorithmica 33, 2, 201--226.
|
 |
2
|
Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
Barabasi, A. L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., and Vicseks, T. 2002. Evolution of the social network of scientific collaborations. Physica A 311, 3--4, 590--614.
|
 |
4
|
|
 |
5
|
|
| |
6
|
Berriz, G. F., King, O. D., Bryant, B., Sander, C., and Roth, F. P. 2003. Characterizing gene sets with funcassociate. Bioinformatics 19, 18, 2502--2504.
|
| |
7
|
Brandes, U., Gaertler, M., and Wagner, D. 2003. Experiments on graph clustering algorithms. In Proceedings of the 11th Annual European Symposium on Algorithms (Budapest, Hungary). Springer-Verlag, Berlin/Heidelberg, Germany, 568--579.
|
| |
8
|
Brucker, P. 1977. On the complexity of clustering problems. In Optimization and Operations Research, R. Hehn, B. Korte, and W. Oettli, Eds. Springer-Verlag, Berlin, Germany, 45--54.
|
| |
9
|
Chan, P. K., Schlag, M. D. F., and Zien, J. Y. 1994. Spectral k-way ratio-cut partitioning and clustering. IEEE Trans. Computer-Aided Desi. Integ. Circ. Syst. 13, 9, 1088--1096.
|
| |
10
|
|
| |
11
|
|
| |
12
|
CiteSeer. 2006. Scientific literature digital library. http://citeseer.ist.psu.edu/.
|
| |
13
|
Davidson, I. and Ravi, S. S. 2005. Clustering with constraints: Feasibility issues and the k-means algorithm. In Proceedings of the 5th SIAM International Conference on Data Mining (Newport Beach, CA). Society for Industrial and Applied Mathematics, Philadelphia, PA, 138--149.
|
| |
14
|
DBLP. Computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/index.html.
|
| |
15
|
|
| |
16
|
|
| |
17
|
Dyer, M. and Frieze, A. M. 1985. A simple heuristic for the p-center problem. Oper. Res. Lett. 3, 285--288.
|
| |
18
|
Ester, M., Ge, R., Gao, B. J., Hu, Z., and Ben-Moshe, B. 2006. Joint cluster analysis of attribute data and relationship data: the connected k-center problem. In Proceedings of the 6th SIAM Conference on Data Mining (Bethesda, MD). Society for Industrial and Applied Mathematics, Philadelphia, PA, 246--257.
|
| |
19
|
Ester, M., Kriegel, H., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (Portland, OR). AAAI Press, 226--231.
|
 |
20
|
|
| |
21
|
Frederickson, G. N. and Johnson, D. B. 1979. Optimal algorithms for generating quantile information in x + y and matrices with sorted columns. In Proceedings of the 13th Annual Conference on Information Science and Systems. The Johns Hopkins Univ., Baltimore, MD, 47--52.
|
| |
22
|
Gonzalez, T. 1985. Clustering to minimize the maximum inter-cluster distance. Theoret. Comput. Sci. 38, 2--3, 293--306.
|
| |
23
|
|
| |
24
|
|
| |
25
|
Hanisch, D., Zien, A., Zimmer, R., and Lengauer, T. 2002. Co-clustering of biological networks and gene expression data. Bioinformatics 18, S145--S154.
|
| |
26
|
Hanneman, R. A. and Riddle, M. 2005. Introduction to social network methods. http://faculty.ucr.edu/~hanneman/.
|
| |
27
|
|
| |
28
|
Hochbaum, D. and Shmoys, D. 1985. A best possible heuristic for the k-center problem. Math. Oper. Res. 10, 180--184.
|
| |
29
|
Iacobucci, D. 1996. Networks in Marketing. Sage Publications, Thousand Oaks, CA.
|
| |
30
|
|
 |
31
|
|
| |
32
|
Kariv, O. and Hakimi, S. L. 1979. An algorithmic approach to network location problems, Part II: p-medians. SIAM J. Appl. Math. 37, 539--560.
|
| |
33
|
|
| |
34
|
Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
|
 |
35
|
Brian Kulis , Sugato Basu , Inderjit Dhillon , Raymond Mooney, Semi-supervised graph clustering: a kernel approach, Proceedings of the 22nd international conference on Machine learning, p.457-464, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102409]
|
| |
36
|
|
| |
37
|
Lloyd, S. 1982. Least squares quantization in pcm. IEEE Trans. Inf. Theory 28, 2, 129--136.
|
| |
38
|
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematics, Statistics and Probability. 281--297.
|
| |
39
|
Megiddo, N. and Supowit, K. J. 1984. On the complexity of some common geometric location problems. SIAM Journal on Computing 13, 1, 182--196.
|
| |
40
|
Megiddo, N., Tamir, A., Zemel, E., and Chandrasekaran, R. 1981. An o(n log2 n) algorithm for the k-th longest path in a tree with applications to location problems. SIAM J. Comput. 10, 2, 328--337.
|
 |
41
|
|
| |
42
|
|
| |
43
|
Scott, J. 2000. Social Network Analysis: A handbook. Sage Publications, Thousand Oaks, CA.
|
| |
44
|
Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics (Suppl. 1) 19, 264--272.
|
| |
45
|
|
| |
46
|
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. 1998. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molec. Biol. Cell 9, 12, 3273--3297.
|
| |
47
|
Stark, C., Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. 2006. Biogrid: A general repository for interaction datasets. Nucleic Acids Res. 34, D535--D539.
|
| |
48
|
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In KDD Workshop on Text Mining.
|
| |
49
|
Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bulletin L'Acadmie Polonaise des Science C1. III, IV, 801--804.
|
| |
50
|
|
| |
51
|
Tamir, A. 1996. An o(pn2) algorithm for the p-median and related problems on tree graphs. Operations Research Letters 19, 59--64.
|
| |
52
|
Taskar, B., Segal, E., and Koller, D. 2001. Probabilistic classification and clustering in relational data. In Proceedings of 17th International Joint Conference on Artificial Intelligence (Seattle, WA). Morgan Kaufmann, San Francisco, CA, 870--878.
|
| |
53
|
Toregas, C., Swan, R., Revelle, C., and Bergman, L. 1971. The location of emergency service facilities. Oper. Res. 19, 1363--1373.
|
| |
54
|
|
| |
55
|
Ulitsky, I. and Shamir, R. 2007. Identification of functional modules using network topology and high-throughput data. BMC System Biology 1, 8.
|
| |
56
|
Wasserman, S. and Faust, K. 1994. Social Network Analysis. Cambridge University Press, Cambridge, UK.
|
| |
57
|
Webster, C. and Morrison, P. 2004. Network analysis in marketing. Australasian Market. J. 12, 2, 8--18.
|
|