ACM Home Page
Please provide us with feedback. Feedback
Identification of gene function using prediction by partial matching (PPM) language models
Full text PdfPdf (547 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: KM: data mining table of contents
Pages 779-786  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Malika Mahoui  IUPUI, Indianapolis, IN, USA
William John Teahan  University of Wales, Bangor, Wales, United Kngdm
Arvind Kumar Thirumalaiswamy Sekhar  Dow AgroSciences, Indianapolis, IN, USA
Satyasaibabu Chilukuri  IUPUI, Indianapolis, IN, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 89,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458186
What is a DOI?

ABSTRACT

In this paper, we describe the utilization of text encoding and prediction by partial matching language modeling to identify gene functions within abstracts of biomedical papers. The National Center for Biotechnology Information has "GeneRIF" - a collection of the best possible functional representations for a subset of abstracts from PubMed. We use GeneRIF to test the efficiency of our technique. We discuss the methodology adopted to construct models necessary to enable the Text Mining Toolkit to distinguish between gene functions and the rest of the abstract (non gene functions). We also describe the similarity based approach we deploy on the list of automatically annotated functions to generate the most likely gene function representative of the paper. The results indicate that our combined approach to identify gene functions in scientific abstracts performs very well on both precision and recall, and therefore presents exciting opportunities for use in extracting other entities embedded in scientific text.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
PubMed; http://www.ncbi.nlm.nih.gov/entrez/ (2008).
 
2
 
3
Bickel, S. , Brefeld, U., Faulstich, L., Jörg , et al. (2004) "Support Vector Machine Classifier for Gene Name Recognition". EMBO Workshop: A critical assessment of text mining methods in molecular biology. Granada, Spain, March.
 
4
Ono T, Hishigaki H, Tanigami A, Takagi T. (2001) "Automated extraction of information on protein-protein interactions from the biological literature". Bioinformatics, Feb;17(2):155--61.
 
5
 
6
Hanisch, D., Fluck, J., Mevissen, H. and Zimmer,R. (2003) Playing Biology's name game: identifying protein names in scientific text. Pacific Symposium on Biocomputing.,403--414.
 
7
 
8
 
9
 
10
Chiang, J.H., Yu, H.C. (2003) "MeKE: Discovering the functions of gene products from biomedical literature using sentence alignment". Bioinformatics, Vol. 19 no. 11 2003, pages 1417-- 1422.
 
11
Raychaudhuri S., Chang J.T., Sutphin P.D., Altman R.B. (2002) "Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature". Genome Research. Jan;12(1):203--14.
 
12
Seki, K., Mostafa, J. (2003) "Towards database curation in biology automated gene function identification from text", Tech. report, Indiana University, http://lair.slis.indiana.edu/research/capris/papers.html.
 
13
Gene ontology. (2007) http://www.geneontology.org/ .
 
14
NCBI. (2008) http://www.ncbi.nlm.nih.gov/.
 
15
GeneRIF.(2007) http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html.
 
16
Teahan, W.J. (2006) The Text Mining Toolkit. http://www.cs.bangor.ac.uk/~wjt.
 
17
Cleary, J and Witten, I. (1984) "Data compression using adaptive coding and partial string matching." IEEE Transactions on Communications,32(4),396--402.
 
18
Moffat, A. (1990) "Implementing the PPM data compression scheme." IEE Transactions on Communications, 38(11): 1917--1921.
 
19
Teahan, W. J., & Harper, D. J. (2003) "Using compression-based language models for text categorization". Language Modeling for Information Retrieval, edited by W. B. Croft, & J. Lafferty, Kluwer. page(s) 141--166.
 
20
Teahan, W.J. (1998) "Modelling English Text", Ph.D. thesis, Dept. of Computer Science, The University of Waikato.
 
21
 
22
 
23
24
 
25
Jelinek, F. (1985) "Self-organized Language Modeling for Speech Recognition," IBM Report.
 
26
Shannon, C.E. (1948) "A mathematical theory of communication." Bell SystemTechnical Journal 27: 379--423, 623--656.
 
27
Viterbi, A.J. (1967) "Error bounds for convolutional codes and an asymptotically optimal decoding algorithm." IEE Trans. on Information Theory. 13, 260--269.
 
28
Yeates, S. and Witten, I.H. (2000) "On tag insertion and its complexity." Proc. PRICAI'2000 Workshop on Text and Data Mining, pages. 52--63, Melbourne, Aus.
 
29
 
30
 
31
Teahan, W. (2000) "Text Classification and Segmentation Using Minimum Cross-Entropy". Proceedings of the International Conference on Content-based Multimedia Information Access (RIAO 2000), pages 943--961.
 
32
GeneRIF help. (2007) http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html.
 
33

Collaborative Colleagues:
Malika Mahoui: colleagues
William John Teahan: colleagues
Arvind Kumar Thirumalaiswamy Sekhar: colleagues
Satyasaibabu Chilukuri: colleagues