ACM Home Page
Please provide us with feedback. Feedback
Compositional mining of multirelational biological datasets
Full text PdfPdf (1.23 MB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 2 ,  Issue 1  (March 2008) table of contents
Article No. 2  
Year of Publication: 2008
ISSN:1556-4681
Authors
Ying Jin  Virginia Tech
T. M. Murali  Virginia Tech
Naren Ramakrishnan  Virginia Tech
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 324,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1342320.1342322
What is a DOI?

ABSTRACT

High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this article, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. 2000. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat. Genet. 25, 1 (May), 25--29.
 
5
Bader, G., Betel, D., and Hogue, C. 2003. BIND: the biomolecular interaction network database. Nucleic Acids Resear. 31, 1, 248--250.
 
6
Ball, C., Awad, I., Demeter, J., Gollub, J., Hebert, J., Hernandez-Boussard, T., Jin, H., Matese, J., Nitzberg, M., Wymore, F., Zachariah, Z., Brown, P., and Sherlock, G. 2005. The stanford microarray database accomodates additional microarray platforms and data formats. Nucleic Acids Resear. 1, 33(Jan.), D580--D582.
 
7
Bayardo, R. 2002. The many roles of constraints in data mining. ACM SIGKDD Explorations 4, 1(June), 1--2.
 
8
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statis. Soc. 57, 289--300.
 
9
Blalock, E. M., Geddes, J. W., Chen, K. C., Porter, N. M., Markesbery, W. R., and Landfield, P. W. 2004. Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc. Natl. Acad. Sci. 101, 7, 2173--8.
 
10
Browne, E. P., Wing, B., Coleman, D., and Shenk, T. 2001. Altered cellular mRNA levels in human cytomegalovirus-infected fibroblasts: Viral block to the accumulation of antiviral mRNAs. J. Virol. 75, 24, 12319--30.
11
 
12
Carpenter, A. and Sabatini, D. 2004. Systematic genome-wide screens of gene function. Nature Rev. Genetics 5, 1(Jan.), 11--22.
 
13
Chen et al., N. 2005. WormBase: A comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Resear. 33, D383--D389.
 
14
Christie, K., Weng, S., Balakrishnan, R., Costanzo, M., Dolinski, K., Dwight, S., Engel, S., Feierbach, B., Fisk, D., Hirschman, J., Hong, E., Issel-Tarver, L., Nash, R., Sethuraman, A., Starr, B., Theesfeld, C., Andrada, R., Binkley, G., Dong, Q., Lane, C., Schroeder, M., Botstein, D., and Cherry, J. 2004. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences s from other organisms. Nucleic Acids Resear. 32, D311--4.
 
15
 
16
17
 
18
Drysdale, R. A. and Crosby, M. A. 2005. FlyBase: Genes and gene models. Nucleic Acids Resear. 33.
 
19
 
20
 
21
Galindo, C. L., Sha, J., Ribardo, D. A., Fadl, A. A., Pillai, L., and Chopra, A. K. 2003. Identification of aeromonas hydrophila cytotoxic enterotoxin-induced genes in macrophages using microarrays. J. Biol. Chem. 278, 41, 40198--212.
 
22
Grossmann, S., Bauer, S., Robinson, P., and Vingron, M. 2006. An improved statistic for detecting over-represented Gene Ontology annotations in gene sets. Lecture Notes in Computer Science, Vol. 3909, 85--98.
 
23
Grothaus, G., Mufti, A., and Murali, T. 2006. Automatic layout and visualization of biclusters. Algor. Molec. Biol. Vol. 1, 15.
 
24
Gunsalus, K. and Piano, F. 2005. RNAi as a tool to study cell biology: Building the genome-phenome bridge. Cur. Opin. Cell Biol. Vol. 17, 1, 3--8.
 
25
Huala et al., E. 2001. The Arabidopsis Information Resource (TAIR): A comprehensive database and Web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Resear. 29, 1, 102--105.
 
26
Joshi-Tope, G., Gillespie, M., Vastrik, I., D'Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G., Wu, G., Matthews, L., Lewis, S., Birney, E., and Stein, L. 2005. Reactome: A knowledgebase of biological pathways. Nucleic Acids Resear. 33, D428--32.
27
28
 
29
Lehner, B. and Fraser, A. G. 2004. A first-draft human protein-interaction map. Genome Biol 5, 9, R63.
30
 
31
 
32
Matzke, M. and Birchler, J. 2005. RNAi-mediated pathways in the nucleus. Nature Revi. Genetics 6, 1, 24--35.
 
33
Matzke, M. and Matzke, A. 2004. Planting the seeds of a new paradigm. PLoS Biol. 2, 5, 0582--0586.
 
34
Michalski, R. 1980. Knowledge acquisition through conceptual Clustering: A theoretical framework and algoritha for partitioning data into conjunctive concepts. Inter. J. Policy Anal. Inform. Syst. 4, 219--243.
35
 
36
Murali, T. and Kasif, S. 2003. Extracting conserved gene expression motifs from gene expression data. In Proceedings of the Pacific Symposium on Biocomputing. 77--88.
 
37
Murray, J. I., Whitfield, M. L., Trinklein, N. D., Myers, R. M., Brown, P. O., and Botstein, D. 2004. Diverse and specific gene expression responses to stresses in cultured human cells. Mol Biol Cell 15, 5, 2361--74.
 
38
Ogawa-Goto, K., Irie, S., Omori, A., Miura, Y., Katano, H., Hasegawa, H., Kurata, T., Sata, T., and Arao, Y. 2002. An endoplasmic reticulum protein, p180, is highly expressed in human cytomegalovirus-permissive cells and interacts with the tegument protein encoded by UL48. J Virol 76, 5, 2350--62.
 
39
Parida, L. and Ramakrishnan, N. 2005. Redescription mining: Structure theory and algorithms. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI'05). 837--844.
 
40
Pati, A., Vasquez-Robinet, C., Heath, L., Grene, R., and Murali, T. 2006. XcisClique: Analysis of regulatory bicliques. BMC Bioinform. Vol. 7, 1, 218.
 
41
Peri, S., Navarro, J., Amanchy, R., Kristiansen, T., Jonnalagadda, C., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T., Gronborg, M., Ibarrola, N., Deshpande, N., Shanker, K., Shivashankar, H., Rashmi, B., Ramya, M., Zhao, Z., Chandrika, K., Padma, N., Harsha, H., Yatish, A., Kavitha, M., Menezes, M., Choudhury, D., Suresh, S., Ghosh, N., Saravana, R., Chandran, S., Krishna, S., Joy, M., Anand, S., Madavan, V., Joseph, A., Wong, G., Schiemann, W., Constantinescu, S., Huang, L., Khosravi-Far, R., Steen, H., Tewari, M., Ghaffari, S., Blobe, G., Dang, C., Garcia, J., Pevsner, J., Jensen, O., Roepstorff, P., Deshpande, K., Chinnaiyan, A., Hamosh, A., Chakravarti, A., and Pandey, A. 2003. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 10, 2363--71.
 
42
43
 
44
Ramani, A. K., Bunescu, R. C., Mooney, R. J., and Marcotte, E. M. 2005. Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 6, 5, R40.
 
45
Rual et al., J. 2005. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 7062, 1173--1178.
 
46
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the 3rd International Conference on Principles of Knowledge Representation and Reasoning (KR'92). 539--550.
47
 
48
Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F., Goehler, H., Stroedicke, M., Zenkner, M., Schoenherr, A., Koeppen, S., Timm, J., Mintzlaff, S., Abraham, C., Bock, N., Kietzmann, S., Goedde, A., Toksoz, E., Droege, A., Krobitsch, S., Korn, B., Birchmeier, W., Lehrach, H., and Wanker, E. 2005. A human protein-protein interaction network: A resource for annotating the proteome. Cell 122, 6, 957--968.
 
49
Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., and Mesirov, J. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci.
 
50
Tanay, A., Sharan, R., and Shamir, R. 2002. Discovering statistically significant biclusters in gene expression data. Bioinform. 18, S136--S144.
 
51
Tanay, A., Sharan, R., and Shamir, R. 2005. Biclustering algorithms: A survey. In Handbook of Computational Molecular Biology, S. Aluru, Ed. CRC Computer and Information Science Series. Chapman & Hall.
52
 
53
Whitfield, M. L., Sherlock, G., Saldanha, A. J., Murray, J. I., Ball, C. A., Alexander, K. E., Matese, J. C., Perou, C. M., Hurt, M. M., Brown, P. O., and Botstein, D. 2002. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell. 13, 6, 1977--2000.
 
54
Zaki, M. and Hsiao, C.-J. 2002. CHARM: An efficient algorithm for closed itemset mining. In SIAM International Conference on Data Mining. 457--473.
55
56

Collaborative Colleagues:
Ying Jin: colleagues
T. M. Murali: colleagues
Naren Ramakrishnan: colleagues