|
ABSTRACT
Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different labs or during various biological processes may overcome the heavy noise in the data. Moreover, by joint mining of gene expression data and protein-protein interaction data, we may discover clusters of genes which show coherent expression patterns and also produce interacting proteins. Such clusters may be potential pathways. In this article, we investigate a novel data mining problem, mining frequent cross-graph quasi-cliques, which is generalized from several interesting applications in bioinformatics, cross-market customer segmentation, social network analysis, and Web mining. In a graph, a set of vertices S is a γ-quasi-clique (0 < γ ≤ 1) if each vertex v in S directly connects to at least γ ⋅ (|S| − 1) other vertices in S. Given a set of graphs G1, …, Gn and parameter min_sup (0 < min_sup ≤ 1), a set of vertices S is a frequent cross-graph quasi-clique if S is a γ-quasi-clique in at least min_sup ⋅ n graphs, and there does not exist a proper superset of S having the property. We build a general model, show why the complete set of frequent cross-graph quasi-cliques cannot be found by previous data mining methods, and study the complexity of the problem. While the problem is difficult, we develop practical algorithms which exploit several interesting and effective techniques and heuristics to efficaciously mine frequent cross-graph quasi-cliques. A systematic performance study is reported on both synthetic and real data sets. We demonstrate some interesting and meaningful frequent cross-graph quasi-cliques in bioinformatics. The experimental results also show that our algorithms are efficient and scalable.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc. Natl. Acad. Sci. 96, 12, 6745--6750.
|
| |
3
|
Bader, G. and Hogue, C. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1, 2.
|
| |
4
|
Barab´si, A. and Albert, R. 1999. Emergence of scaling in random networks. Science 286.
|
| |
5
|
Bayada, D., Simpson, R., and Johnson, A. 1992. An algorithm for the multiple common subgraph problem. J. Chemi. Inform. Comput. Sci., 680--685.
|
| |
6
|
Beissbarth, T., Fellenberg, K., Brors, B., et al. 2000. Processing and quality control of DNA array hybridization data. Bioinformatics 16, 1014--1022.
|
| |
7
|
Ben-Dor, A., Shamir, R., and Yakhini, Z. 1999. Clustering gene expression patterns. J. Computati. Biol. 6, 3/4, 281--297.
|
| |
8
|
Brazma, A. and Vilo, J. 2000. Minireview: Gene expression data analysis. Federation European Biochemical Societies 480, 17--24.
|
| |
9
|
Bu, D., Zhao, Y., Cai, L., et al. 2003. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucl. Acids Research. 31, 9, 2443--2450.
|
| |
10
|
Cho, R., Campbell, M., Winzeler, E., et al. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mole. Cell 2, 1, 65--73.
|
| |
11
|
Choi, K., Satterberg, B., Lyons, D., and Elion, E. 1994. Ste5 tethers multiple protein kinases in the map kinase cascade required for mating in S. cerevisiae. Cell 78, 499--C512.
|
| |
12
|
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P., and Herskowitz, I. 1998. The transcriptional program of sporulation in budding yeast. Science 282, 5389, 699--705.
|
| |
13
|
DeRisi, J., Iyer, V., and Brown, P. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 680--686.
|
| |
14
|
|
| |
15
|
Dongen, S. 2000. Graph clustering by flow simulation. PhD Thesis, University of Utrecht, The Netherlands.
|
 |
16
|
|
| |
17
|
Eisen, M., Spellman, P., Brown, P., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863--14868.
|
| |
18
|
Enright, A., Dongen, S., and Ouzounis, C. 2002. An efficient algorithm for large-scale detection of protein families. Nucl. Acids Resear. 30, 7, 1575--1584.
|
 |
19
|
|
| |
20
|
|
| |
21
|
Gasch, A., Huang, M., Metzner, S., Botstein, D., Elledge, S., and Brown, P. 2001. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell 12, 10, 2987--3003.
|
| |
22
|
Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D., and Brown, P. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11, 12, 4241--4257.
|
| |
23
|
Gavin, A., Bosche, M., Krause, R., and et. al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 6868, 123--124.
|
| |
24
|
|
| |
25
|
Ho, Y., Gruhler, A., Heilbut, A., et al. 2002. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415, 180--183.
|
| |
26
|
Holder, L., Cook, D., and Djoko, S. 1994. Substructure discovery in the subdue system. In Proceedings of the AAAI'94 Workshop on Knowledge Discvoery in Databases (KDD'94). 359--370.
|
| |
27
|
|
| |
28
|
Ito, T., Tashiro, K., Muta, S., Ozawa, R., Chiba, T., Nishizawa, M., Yamamoto, K., Kuhara, S., and Sakaki, Y. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine twohybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci.
|
 |
29
|
|
| |
30
|
|
| |
31
|
Lee, H., Hsu, A., Sajdak, J., Qin, J., and Pavlidis, P. 2004. Coexpression analysis of human genes across many microarray data sets. Genome Resear. 14, 1085--1094.
|
| |
32
|
Lyons, D., Mahanty, S., Choi, K., Manandhar, M., and Elion, E. 1996. The Sh3-domain protein Bem1 coordinates mitogen-activated protein kinase cascade activation with cell cycle control in Saccharomyces cerevisiae. Mol. Cell Biol. 16, 4095--C4106.
|
| |
33
|
|
| |
34
|
Mewes, H., Frishman, D., Mayer, K., et al. 2006. MIPS: Analysis and annotation of proteins from whole genomes in 2005. Nucl. Acids Resear. 34 (Supplement 1), D169--D172.
|
| |
35
|
Moreau, Y., Aerts, S., Moor, B., Strooper, B., and Dabrowski, M. 2003. Comparison and meta-analysis of microarray data: From the bench to the computer desk. TRENDS Genetics 19, 570--577.
|
| |
36
|
Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: analysis and an algorithm. Adv. Neural Inform. Process. Syst. 14.
|
 |
37
|
|
 |
38
|
|
 |
39
|
|
| |
40
|
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of International Conference on Principles of Knowledge Representation Reasoning (RR'92).
|
| |
41
|
Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19, i264--i272.
|
| |
42
|
|
| |
43
|
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., and Futcher, B. 1998. Exploring the metabolic and genetic control of gene expression on a genomic scale. Mol. Biol. Cell 3273.
|
| |
44
|
Stuart, J., Segal, E., Koller, D., and Kim, S. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249--255.
|
| |
45
|
Takahashi, Y., Satoh, Y., and Sasaki, S. 1987. Recognition of largest common fragment among a variety of chemical structures. Analyt. Sci., 23--28.
|
| |
46
|
Tamayo, P., Solni, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. 96, 6, 2907--2912.
|
| |
47
|
Tavazoie, S., Hughes, D., Campbell, M., Cho, R., and Church, G. 1999. Systematic determination of genetic network architecture. Nature Genet. 281--285.
|
| |
48
|
Tong, A. H., Evangelista, M., Parsons, A. B., and et al. 2001. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294, 2364--2368.
|
| |
49
|
Tong, A. H. Y., Lesage, G., Bader, G. D., and et al. 2004. Global mapping of the yeast genetic interaction network. Science 303, 808--813.
|
| |
50
|
Tseng, G., Oh, M., Rohlin, L., Liao, J., and Wong, W. 2001. Issues in cDNA microarray analysis: Quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Resear. 29, 2549--2557.
|
| |
51
|
Uetz, P., Giot, L., Cagney, G., and et. al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces Cerevisiae. Nature 403, 6770, 601--603.
|
 |
52
|
Chen Wang , Wei Wang , Jian Pei , Yongtai Zhu , Baile Shi, Scalable mining of large disk-based graph databases, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014088]
|
| |
53
|
Xu, Y., Olman, V., and Xu, D. 2002. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536--545.
|
| |
54
|
|
 |
55
|
|
 |
56
|
|
|