|
ABSTRACT
High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this article, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Foto Afrati , Gautam Das , Aristides Gionis , Heikki Mannila , Taneli Mielikainen , Panayiotis Tsaparas, Mining Chains of Relations, Proceedings of the Fifth IEEE International Conference on Data Mining, p.553-556, November 27-30, 2005
[doi> 10.1109/ICDM.2005.94]
|
| |
2
|
|
| |
3
|
|
| |
4
|
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. 2000. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat. Genet. 25, 1 (May), 25--29.
|
| |
5
|
Bader, G., Betel, D., and Hogue, C. 2003. BIND: the biomolecular interaction network database. Nucleic Acids Resear. 31, 1, 248--250.
|
| |
6
|
Ball, C., Awad, I., Demeter, J., Gollub, J., Hebert, J., Hernandez-Boussard, T., Jin, H., Matese, J., Nitzberg, M., Wymore, F., Zachariah, Z., Brown, P., and Sherlock, G. 2005. The stanford microarray database accomodates additional microarray platforms and data formats. Nucleic Acids Resear. 1, 33(Jan.), D580--D582.
|
| |
7
|
Bayardo, R. 2002. The many roles of constraints in data mining. ACM SIGKDD Explorations 4, 1(June), 1--2.
|
| |
8
|
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statis. Soc. 57, 289--300.
|
| |
9
|
Blalock, E. M., Geddes, J. W., Chen, K. C., Porter, N. M., Markesbery, W. R., and Landfield, P. W. 2004. Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc. Natl. Acad. Sci. 101, 7, 2173--8.
|
| |
10
|
Browne, E. P., Wing, B., Coleman, D., and Shenk, T. 2001. Altered cellular mRNA levels in human cytomegalovirus-infected fibroblasts: Viral block to the accumulation of antiviral mRNAs. J. Virol. 75, 24, 12319--30.
|
 |
11
|
|
| |
12
|
Carpenter, A. and Sabatini, D. 2004. Systematic genome-wide screens of gene function. Nature Rev. Genetics 5, 1(Jan.), 11--22.
|
| |
13
|
Chen et al., N. 2005. WormBase: A comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Resear. 33, D383--D389.
|
| |
14
|
Christie, K., Weng, S., Balakrishnan, R., Costanzo, M., Dolinski, K., Dwight, S., Engel, S., Feierbach, B., Fisk, D., Hirschman, J., Hong, E., Issel-Tarver, L., Nash, R., Sethuraman, A., Starr, B., Theesfeld, C., Andrada, R., Binkley, G., Dong, Q., Lane, C., Schroeder, M., Botstein, D., and Cherry, J. 2004. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences s from other organisms. Nucleic Acids Resear. 32, D311--4.
|
| |
15
|
|
| |
16
|
|
 |
17
|
Robin Dhamankar , Yoonkyong Lee , AnHai Doan , Alon Halevy , Pedro Domingos, iMAP: discovering complex semantic matches between database schemas, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007612]
|
| |
18
|
Drysdale, R. A. and Crosby, M. A. 2005. FlyBase: Genes and gene models. Nucleic Acids Resear. 33.
|
| |
19
|
|
| |
20
|
|
| |
21
|
Galindo, C. L., Sha, J., Ribardo, D. A., Fadl, A. A., Pillai, L., and Chopra, A. K. 2003. Identification of aeromonas hydrophila cytotoxic enterotoxin-induced genes in macrophages using microarrays. J. Biol. Chem. 278, 41, 40198--212.
|
| |
22
|
Grossmann, S., Bauer, S., Robinson, P., and Vingron, M. 2006. An improved statistic for detecting over-represented Gene Ontology annotations in gene sets. Lecture Notes in Computer Science, Vol. 3909, 85--98.
|
| |
23
|
Grothaus, G., Mufti, A., and Murali, T. 2006. Automatic layout and visualization of biclusters. Algor. Molec. Biol. Vol. 1, 15.
|
| |
24
|
Gunsalus, K. and Piano, F. 2005. RNAi as a tool to study cell biology: Building the genome-phenome bridge. Cur. Opin. Cell Biol. Vol. 17, 1, 3--8.
|
| |
25
|
Huala et al., E. 2001. The Arabidopsis Information Resource (TAIR): A comprehensive database and Web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Resear. 29, 1, 102--105.
|
| |
26
|
Joshi-Tope, G., Gillespie, M., Vastrik, I., D'Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G., Wu, G., Matthews, L., Lewis, S., Birney, E., and Stein, L. 2005. Reactome: A knowledgebase of biological pathways. Nucleic Acids Resear. 33, D428--32.
|
 |
27
|
Deept Kumar , Naren Ramakrishnan , Richard F. Helm , Malcolm Potts, Algorithms for storytelling, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150475]
|
 |
28
|
|
| |
29
|
Lehner, B. and Fraser, A. G. 2004. A first-draft human protein-interaction map. Genome Biol 5, 9, R63.
|
 |
30
|
Bo Long , Xiaoyun Wu , Zhongfei (Mark) Zhang , Philip S. Yu, Unsupervised learning on k-partite graphs, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150439]
|
| |
31
|
|
| |
32
|
Matzke, M. and Birchler, J. 2005. RNAi-mediated pathways in the nucleus. Nature Revi. Genetics 6, 1, 24--35.
|
| |
33
|
Matzke, M. and Matzke, A. 2004. Planting the seeds of a new paradigm. PLoS Biol. 2, 5, 0582--0586.
|
| |
34
|
Michalski, R. 1980. Knowledge acquisition through conceptual Clustering: A theoretical framework and algoritha for partitioning data into conjunctive concepts. Inter. J. Policy Anal. Inform. Syst. 4, 219--243.
|
 |
35
|
|
| |
36
|
Murali, T. and Kasif, S. 2003. Extracting conserved gene expression motifs from gene expression data. In Proceedings of the Pacific Symposium on Biocomputing. 77--88.
|
| |
37
|
Murray, J. I., Whitfield, M. L., Trinklein, N. D., Myers, R. M., Brown, P. O., and Botstein, D. 2004. Diverse and specific gene expression responses to stresses in cultured human cells. Mol Biol Cell 15, 5, 2361--74.
|
| |
38
|
Ogawa-Goto, K., Irie, S., Omori, A., Miura, Y., Katano, H., Hasegawa, H., Kurata, T., Sata, T., and Arao, Y. 2002. An endoplasmic reticulum protein, p180, is highly expressed in human cytomegalovirus-permissive cells and interacts with the tegument protein encoded by UL48. J Virol 76, 5, 2350--62.
|
| |
39
|
Parida, L. and Ramakrishnan, N. 2005. Redescription mining: Structure theory and algorithms. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI'05). 837--844.
|
| |
40
|
Pati, A., Vasquez-Robinet, C., Heath, L., Grene, R., and Murali, T. 2006. XcisClique: Analysis of regulatory bicliques. BMC Bioinform. Vol. 7, 1, 218.
|
| |
41
|
Peri, S., Navarro, J., Amanchy, R., Kristiansen, T., Jonnalagadda, C., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T., Gronborg, M., Ibarrola, N., Deshpande, N., Shanker, K., Shivashankar, H., Rashmi, B., Ramya, M., Zhao, Z., Chandrika, K., Padma, N., Harsha, H., Yatish, A., Kavitha, M., Menezes, M., Choudhury, D., Suresh, S., Ghosh, N., Saravana, R., Chandran, S., Krishna, S., Joy, M., Anand, S., Madavan, V., Joseph, A., Wong, G., Schiemann, W., Constantinescu, S., Huang, L., Khosravi-Far, R., Steen, H., Tewari, M., Ghaffari, S., Blobe, G., Dang, C., Garcia, J., Pevsner, J., Jensen, O., Roepstorff, P., Deshpande, K., Chinnaiyan, A., Hamosh, A., Chakravarti, A., and Pandey, A. 2003. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 10, 2363--71.
|
| |
42
|
|
 |
43
|
Naren Ramakrishnan , Deept Kumar , Bud Mishra , Malcolm Potts , Richard F. Helm, Turning CARTwheels: an alternating algorithm for mining redescriptions, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014083]
|
| |
44
|
Ramani, A. K., Bunescu, R. C., Mooney, R. J., and Marcotte, E. M. 2005. Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 6, 5, R40.
|
| |
45
|
Rual et al., J. 2005. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 7062, 1173--1178.
|
| |
46
|
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the 3rd International Conference on Principles of Knowledge Representation and Reasoning (KR'92). 539--550.
|
 |
47
|
|
| |
48
|
Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F., Goehler, H., Stroedicke, M., Zenkner, M., Schoenherr, A., Koeppen, S., Timm, J., Mintzlaff, S., Abraham, C., Bock, N., Kietzmann, S., Goedde, A., Toksoz, E., Droege, A., Krobitsch, S., Korn, B., Birchmeier, W., Lehrach, H., and Wanker, E. 2005. A human protein-protein interaction network: A resource for annotating the proteome. Cell 122, 6, 957--968.
|
| |
49
|
Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., and Mesirov, J. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci.
|
| |
50
|
Tanay, A., Sharan, R., and Shamir, R. 2002. Discovering statistically significant biclusters in gene expression data. Bioinform. 18, S136--S144.
|
| |
51
|
Tanay, A., Sharan, R., and Shamir, R. 2005. Biclustering algorithms: A survey. In Handbook of Computational Molecular Biology, S. Aluru, Ed. CRC Computer and Information Science Series. Chapman & Hall.
|
 |
52
|
Dick Tsur , Jeffrey D. Ullman , Serge Abiteboul , Chris Clifton , Rajeev Motwani , Svetlozar Nestorov , Arnon Rosenthal, Query flocks: a generalization of association-rule mining, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.1-12, June 01-04, 1998, Seattle, Washington, United States
|
| |
53
|
Whitfield, M. L., Sherlock, G., Saldanha, A. J., Murray, J. I., Ball, C. A., Alexander, K. E., Matese, J. C., Perou, C. M., Hurt, M. M., Brown, P. O., and Botstein, D. 2002. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell. 13, 6, 1977--2000.
|
| |
54
|
Zaki, M. and Hsiao, C.-J. 2002. CHARM: An efficient algorithm for closed itemset mining. In SIAM International Conference on Data Mining. 457--473.
|
 |
55
|
|
 |
56
|
|
|