|
ABSTRACT
This paper describes an application of IR and text categorization methods to a highly practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO annotation is a major activity in most model organism database projects and annotates gene functions using a controlled vocabulary. As a first step toward automatic GO annotation, we aim to assign GO domain codes given a specific gene and an article in which the gene appears, which is one of the task challenges at the TREC 2004 Genomics Track. We approached the task with careful consideration of the specialized terminology and paid special attention to dealing with various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extracted the words around the gene occurrences and used them to represent the gene for GO domain code annotation. As a classifier, we adopted a variant of k-Nearest Neighbor (kNN) with supervised term weighting schemes to improve the performance, making our method among the top-performing systems in the TREC official evaluation. Moreover, it is demonstrated that our proposed framework is successfully applied to another task of the Genomics Track, showing comparable results to the best performing system.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Dayanik, D. Fradkin, A. Genkin, P. Kantor, D. D. Lewis, D. Madigan, and V. Menkov. DIMACS at the TREC 2004 genomics track. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.
|
 |
2
|
|
| |
3
|
Sergei Egorov, Anton Yuryev, and Nikolai Daraselia. A simple and practical dictionary-based approach for identification of proteins in MEDLINE abstracts. Journal of the American Medical Informatics Association, 11(3):174--178, 2004.
|
| |
4
|
Sumio Fujita. Revisiting again document length hypotheses TREC-2004 genomics track experiments at Patolis. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.
|
| |
5
|
Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, and Ralf Zimmer. Playing biology's name game: Identifying protein names in scientific text. In Proceedings of the Pacific Symposium on Biocomputing (PSB), volume 8, pages 403--414, 2003.
|
 |
6
|
|
 |
7
|
|
| |
8
|
W.R. Hersh, R.T. Bhuptiraju, L. Ross, A.M. Cohen, and D.F. Kraemer. TREC 2004 genomics track overview. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.
|
| |
9
|
Lynette Hirschman, Jong C. Park, Jun-ichi Tsujii, Limsoon Wong, and Cathy H. Wu. Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18(12):1553--1561, 2002.
|
| |
10
|
Julie Beth Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.
|
| |
11
|
Claire O'Donovan, Maria Jesus Martin, Alexandre Gattiker, Elisabeth Gasteiger, Amos Bairoch, and Rolf Apweiler. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform, 3(3):275--284, 2002.
|
| |
12
|
Kim D. Pruitt and Donna R. Maglott. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Research, 29(1):137--140, 2001.
|
| |
13
|
|
| |
14
|
Ariel S. Schwartz and Marti A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific Symposium on Biocomputing (PSB), volume 8, pages 451--462, 2003.
|
| |
15
|
Burr Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), 2004.
|
| |
16
|
Burr Settles and Mark Craven. Exploiting zone information, syntactic rules, and informative terms in gene ontology annotation of biomedical documents. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.
|
| |
17
|
Hagit Shatkay and Ronen Feldman. Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6):821--856, 2003.
|
 |
18
|
|
| |
19
|
|
|