|
ABSTRACT
In this paper, we describe the utilization of text encoding and prediction by partial matching language modeling to identify gene functions within abstracts of biomedical papers. The National Center for Biotechnology Information has "GeneRIF" - a collection of the best possible functional representations for a subset of abstracts from PubMed. We use GeneRIF to test the efficiency of our technique. We discuss the methodology adopted to construct models necessary to enable the Text Mining Toolkit to distinguish between gene functions and the rest of the abstract (non gene functions). We also describe the similarity based approach we deploy on the list of automatically annotated functions to generate the most likely gene function representative of the paper. The results indicate that our combined approach to identify gene functions in scientific abstracts performs very well on both precision and recall, and therefore presents exciting opportunities for use in extracting other entities embedded in scientific text.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
PubMed; http://www.ncbi.nlm.nih.gov/entrez/ (2008).
|
| |
2
|
|
| |
3
|
Bickel, S. , Brefeld, U., Faulstich, L., Jörg , et al. (2004) "Support Vector Machine Classifier for Gene Name Recognition". EMBO Workshop: A critical assessment of text mining methods in molecular biology. Granada, Spain, March.
|
| |
4
|
Ono T, Hishigaki H, Tanigami A, Takagi T. (2001) "Automated extraction of information on protein-protein interactions from the biological literature". Bioinformatics, Feb;17(2):155--61.
|
| |
5
|
|
| |
6
|
Hanisch, D., Fluck, J., Mevissen, H. and Zimmer,R. (2003) Playing Biology's name game: identifying protein names in scientific text. Pacific Symposium on Biocomputing.,403--414.
|
| |
7
|
Kaoru Yamamoto , Taku Kudo , Akihiko Konagaya , Yuji Matsumoto, Protein name tagging for biomedical annotation in text, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.65-72, July 11-11, 2003, Sapporo, Japan
[doi> 10.3115/1118958.1118967]
|
| |
8
|
|
| |
9
|
|
| |
10
|
Chiang, J.H., Yu, H.C. (2003) "MeKE: Discovering the functions of gene products from biomedical literature using sentence alignment". Bioinformatics, Vol. 19 no. 11 2003, pages 1417-- 1422.
|
| |
11
|
Raychaudhuri S., Chang J.T., Sutphin P.D., Altman R.B. (2002) "Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature". Genome Research. Jan;12(1):203--14.
|
| |
12
|
Seki, K., Mostafa, J. (2003) "Towards database curation in biology automated gene function identification from text", Tech. report, Indiana University, http://lair.slis.indiana.edu/research/capris/papers.html.
|
| |
13
|
Gene ontology. (2007) http://www.geneontology.org/ .
|
| |
14
|
NCBI. (2008) http://www.ncbi.nlm.nih.gov/.
|
| |
15
|
GeneRIF.(2007) http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html.
|
| |
16
|
Teahan, W.J. (2006) The Text Mining Toolkit. http://www.cs.bangor.ac.uk/~wjt.
|
| |
17
|
Cleary, J and Witten, I. (1984) "Data compression using adaptive coding and partial string matching." IEEE Transactions on Communications,32(4),396--402.
|
| |
18
|
Moffat, A. (1990) "Implementing the PPM data compression scheme." IEE Transactions on Communications, 38(11): 1917--1921.
|
| |
19
|
Teahan, W. J., & Harper, D. J. (2003) "Using compression-based language models for text categorization". Language Modeling for Information Retrieval, edited by W. B. Croft, & J. Lafferty, Kluwer. page(s) 141--166.
|
| |
20
|
Teahan, W.J. (1998) "Modelling English Text", Ph.D. thesis, Dept. of Computer Science, The University of Waikato.
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
Jelinek, F. (1985) "Self-organized Language Modeling for Speech Recognition," IBM Report.
|
| |
26
|
Shannon, C.E. (1948) "A mathematical theory of communication." Bell SystemTechnical Journal 27: 379--423, 623--656.
|
| |
27
|
Viterbi, A.J. (1967) "Error bounds for convolutional codes and an asymptotically optimal decoding algorithm." IEE Trans. on Information Theory. 13, 260--269.
|
| |
28
|
Yeates, S. and Witten, I.H. (2000) "On tag insertion and its complexity." Proc. PRICAI'2000 Workshop on Text and Data Mining, pages. 52--63, Melbourne, Aus.
|
| |
29
|
|
| |
30
|
|
| |
31
|
Teahan, W. (2000) "Text Classification and Segmentation Using Minimum Cross-Entropy". Proceedings of the International Conference on Content-based Multimedia Information Access (RIAO 2000), pages 943--961.
|
| |
32
|
GeneRIF help. (2007) http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html.
|
| |
33
|
|
INDEX TERMS
Primary Classification:
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.7
Natural Language Processing
Subjects:
Language generation
Additional Classification:
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.7
Natural Language Processing
Subjects:
Language parsing and understanding;
Machine translation;
Language models;
Text analysis
General Terms:
Algorithms,
Design,
Documentation,
Experimentation,
Languages,
Measurement,
Performance,
Theory,
Verification
Keywords:
entropy,
gene function identification,
prediction by partial matching (PPM),
text mining
|