|
ABSTRACT
Joint mining of multiple data sets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in cross-market customer segmentation, a group of customers who behave similarly in multiple markets should be considered as a more coherent and more reliable cluster than clusters found in a single market. As another example, in bioinformatics, by joint mining of gene expression data and protein interaction data, we can find clusters of genes which show coherent expression patterns and also produce interacting proteins. Such clusters may be potential pathways.In this paper, we investigate a novel data mining problem, mining cross-graph quasi-cliques, which is generalized from several interesting applications such as cross-market customer segmentation and joint mining of gene expression data and protein interaction data. We build a general model for mining cross-graph quasi-cliques, show why the complete set of cross-graph quasi-cliques cannot be found by previous data mining methods, and study the complexity of the problem. While the problem is difficult, we develop an efficient algorithm, Crochet, which exploits several interesting and effective techniques and heuristics to efficaciously mine cross-graph quasi-cliques. A systematic performance study is reported on both synthetic and real data sets. We demonstrate some interesting and meaningful cross-graph quasi-cliques in bioinformatics. The experimental results also show that algorithm Crochet is efficient and scalable.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Bader, G.D. and Hogue, C.W. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(1)(2), Jan. 2003.
|
| |
3
|
Barabási, A.L. and Albert, R. Emergence of scaling in random networks. Science, 286, Oct. 1999.
|
| |
4
|
Bayada, D.M. et al. An algorithm for the nultiple common subgraph problem. J. of Chem Info & Comp Sci, pages 680--685, 1992.
|
| |
5
|
Ben-Dor, A. et al. Clustering gene expression patterns. J. of Comp Biology, 6(3/4):281--297, 1999.
|
| |
6
|
Bu, D. et al. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research, 31(9):2443--2450, 2003.
|
| |
7
|
|
| |
8
|
Cho, R.J. et al. A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle. Molecular Cell, Vol. 2(1):65--73, July 1998.
|
| |
9
|
|
| |
10
|
Dongen, S.V. Graph clustering by flow simulation. PhD Thesis, University of Utrecht, The Netherlands, 2000.
|
 |
11
|
|
| |
12
|
Enright, A.J., Dongen, S.V. and Ouzounis, C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Reserach, 30(7):1575--1584, 2002.
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
Holder, L.B. et al. Substructure discovery in the subdue system. In KDD'94.
|
| |
17
|
|
 |
18
|
|
| |
19
|
Kasturi, J. et al. An information theoretic approach for analyzing temporal patterns of gene expression. Bioinformatics, pages 449--458, 2003.
|
| |
20
|
|
| |
21
|
|
| |
22
|
Ng, A.Y. et al. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems, 14, 2001.
|
 |
23
|
|
| |
24
|
Rymon, R. Search through systematic set enumeration. In KR'92.
|
| |
25
|
Segal, E. et al. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:i264--i272, 2003.
|
| |
26
|
|
| |
27
|
Takahashi, Y. et al. Recognition of largest common fragment among a variety of chemical structures. Analytical Sciences, pages 23--28, 1987.
|
 |
28
|
Chen Wang , Wei Wang , Jian Pei , Yongtai Zhu , Baile Shi, Scalable mining of large disk-based graph databases, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014088]
|
 |
29
|
|
| |
30
|
Xu, Y. et al. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18:536--545, 2002.
|
 |
31
|
|
| |
32
|
|
 |
33
|
|
CITED BY 13
|
|
Byung-Won On , Ergin Elmacioglu , Dongwon Lee , Jaewoo Kang , Jian Pei, An effective approach to entity resolution problem using quasi-clique and its application to digital libraries, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
Zhiping Zeng , Jianyong Wang , Lizhu Zhou , George Karypis, Coherent closed quasi-clique discovery from large dense graph databases, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
Shashank Pandit , Duen Horng Chau , Samuel Wang , Christos Faloutsos, Netprobe: a fast and scalable system for fraud detection in online auction networks, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
Nan Du , Bin Wu , Xin Pei , Bai Wang , Liutong Xu, Community detection in large-scale social networks, Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, p.16-25, August 12-12, 2007, San Jose, California
|
|
|
Hanghang Tong , Christos Faloutsos , Brian Gallagher , Tina Eliassi-Rad, Fast best-effort pattern matching in large attributed graphs, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
Tengjiao Wang , Bishan Yang , Jun Gao , Dongqing Yang , Shiwei Tang , Haoyu Wu , Kedong Liu , Jian Pei, MobileMiner: a real world case study of data mining in mobile communication, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
|
|