|
ABSTRACT
Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g., CAT denotes both chloramphenicol acetyl transferase and computed axial tomography, depending on the context), recognizing the full form associated with each abbreviation is in most cases equivalent to identifying the meaning of the abbreviation. This, in turn, allows us to perform more accurate natural language processing, information extraction, and retrieval. In this study, we have developed supervised approaches to identifying the full forms of ambiguous abbreviations within the context they appear. We first automatically assigned multiple possible full forms for each abbreviation; we then treated the in-context full-form prediction for each specific abbreviation occurrence as a case of word-sense disambiguation. We generated automatically a dictionary of all possible full forms for each abbreviation. We applied supervised machine-learning algorithms for disambiguation. Because some of the links between abbreviations and their corresponding full forms are explicitly given in the text and can be recovered automatically, we can use these explicit links to automatically provide training data for disambiguating the abbreviations that are not linked to a full form within a text. We evaluated our methods on over 150 thousand abstracts and obtain for coverage and precision results of 82% and 92%, respectively, when performed as tenfold cross-validation, and 79% and 80%, respectively, when evaluated against an external set of abstracts in which the abbreviations are not defined.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Adar, E. 2002. A simple and robust abbreviation dictionary. Tech. rep., H. P. Laboratories.
|
| |
2
|
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., and Silverman, E. 1954. An empirical distribution function for sampling with incomplete information. Ann. Meth. Statis. 26, 641--647.
|
 |
3
|
|
| |
4
|
Bowden, P. R., Eventt, L., and Halsted, P. 1998. Automatic arconym acquistion in a knowledge extraction program. In Proceedings of the ComputTerm98 Conference. Montreal, Ontario.
|
| |
5
|
Brill, E. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. In Proceedings of the Computational Linguistics.
|
| |
6
|
Chang, J. T., Schutze, H., and Altman, R. B. 2006. Creating an online dictionary of abbreviations from MEDLINE. To appear in JAMIA.
|
| |
7
|
|
| |
8
|
|
| |
9
|
Church, K. W. and Gale, W. A. 1991. Probability scoring for spelling correction. Statis. Comput. 1, 93--103.
|
| |
10
|
|
| |
11
|
Engelson, S. P. and Dagan, I. 1996. Minimizing manual annotation cost in supervised training from corpora. In Proceedings of the Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing (LNAI) Conference, S. Wermter et al., eds.
|
| |
12
|
Fauquet, C. M. and Pringle, C. R. 1999. Abbreviations for vertebrate virus species names. Arch. Virol. 144, 1865--1880.
|
| |
13
|
Federiuk, C. S. 1999. The effect of abbreviations on MEDLINE searching. Acad. Emerg. Med. 6, 292--296.
|
| |
14
|
Fukuda, K., Tamura, A., Tsunoda, T., and Takagi, T. 1998. Toward information extraction: Identifying protein names from biological articles. Pac. Symp. Biocomput, 707--718.
|
| |
15
|
Gale, W., Church, K., and Yarowsky, D. 1992. A method for disambiguating word senses in a large corpus. Comput. Humanities 26, 415--439.
|
| |
16
|
Hardle, W. 1991. Smoothing techniques: With implementation in S. New York. Spring Verlag, New York.
|
| |
17
|
Hatzivassiloglou, V., Duboue. P. A., and Rzhetsky, A. 2001. Disambiguating proteins, genes, and RNA in text: A machine learning approach. Bioinformatics 17, S97--106.
|
| |
18
|
Hearst, M. A. 1991. Noun homograph disambiguation using local context in large text corpora. In Proceedings of the 7th Annual Conference of the U.W. Centre for the New OED and Text Research.
|
| |
19
|
Hisamitsu, T. and Niwa, Y. 1998. Extraction of useful terms from parenthetical experssion by using simple rules and statistical measures. In Proceedings of the CompuTerm98 Conference. Montreal, Canada.
|
| |
20
|
Humphreys, B. L. and Lindberg, D. A. 1993. The UMLS project: Making the conceptual connection between users and the information they need. Bull. Med. Libr. Assoc. 81, 170--177.
|
| |
21
|
Jelinek, F. and Mercer, R. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the PWPRP Conference.
|
| |
22
|
Klavans, J., Chodorow, M., and Wacholder, N. 1990. From dictionary to knowledge base via taxononym. In Proceedings of the 6th Conference of the UW Contre for the New OED. Waterloo, Canada.
|
| |
23
|
Kopff, M., Klem, J., Zakrzewska, I., and Strzelczyk, M. 1990. Effect of dipyridamole on inosine triphosphate pyrophosphohydrolase activity and inosine triphosphate content in fresh human erythrocytes incubated with adenosine. Acta. Biochim. Pol. 37, 227--232.
|
| |
24
|
Langley, P., Iba, W., and Thompson, K. 1992. An analysis of Bayesian classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence.
|
| |
25
|
Langley, P. and Sage, S. 1994. Induction of selective Bayesian classifiers. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence. Seattle, WA.
|
 |
26
|
|
| |
27
|
Lidstone, G. 1992. Note on the general case of the Bayes-Laplace formula for inductive or a priori probabilities. Trans. Faculty Actuaries 8, 182--192.
|
| |
28
|
|
| |
29
|
McCray, A. T. 1998. The nature of lexical knowledge. Methods Inf. Med. 37, 353--360.
|
| |
30
|
|
| |
31
|
Park, J. C., Kim, H. S., and Kim, J. J. 2001. Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. Pac. Symp. Biocomput, 396--407.
|
| |
32
|
Platt, J. 1999. Fast Training of Support Vector Machines Using Sequential Mininal Optimitzation. MIT Press, Cambridge, MA.
|
| |
33
|
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 130--137.
|
| |
34
|
Pustejovsky, J., Castano, J., Cochran, B., Kotecki, M., and Morrell, M. 2001. Automatic extraction of acronym-meaning pairs from MEDLINE databases. In Proceedings of the Medinfo Conference.
|
| |
35
|
|
| |
36
|
Rimer, M. and O'Connell, M. 1998. BioABACUS: A database of abbreviations and acronyms in biotechnology and computer science. Bioinformatics 14, 888--889.
|
| |
37
|
|
| |
38
|
Schwartz, A. S. and Hearst, M. A. 2002. A simple algorithm for identifying abbreviation definitions in biomedical text. To appear in Pac. Symp. Biocomput.
|
 |
39
|
|
| |
40
|
Stapley, B. J. and Benoit, G. 2000. Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac. Symp. Biocomput, 529--540.
|
| |
41
|
Turteltaub, K. W. and Dingley, K. H. 1998. Application of accelerated mass spectrometry (AMS) in DNA adduct quantification and identification. Toxicol. Lett. 102--103, 435--439.
|
| |
42
|
Wellner, B. 2005. Weakly supervised learning methods for improving the quality of gene name normalization data. In Proceedings of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology Conference.
|
| |
43
|
Wilbur, W. J. 2000. Boosting naïve Bayesian learning on a large subset of MEDLINE. Proc. AMIA Symp. 918--922.
|
| |
44
|
Wilbur, W. J. and Kim, W. 2001. Flexible phrase based query handling algorithms. In Proceedings of the ASIST Annual Meeting, E. Aversa and C. Manley, eds. Washington, DC.
|
| |
45
|
|
| |
46
|
|
| |
47
|
|
| |
48
|
Yoshida, M., Fukuda, K., and Takagi, T. 2000. PNAD-CSS: A workbench for constructing a protein name abbreviation dictionary. Bioinformatics 16, 169--175.
|
| |
49
|
Yu, H. and Agichtein, E. 2003. Extracting synonymous gene and protein terms from biological literature. Bioinformatics 19, Suppl. 1, i340--349.
|
| |
50
|
Yu, H., Hripcsak, G., and Friedman, C. 2002. Mapping abbreviations to full forms in biomedical articles. J. Am. Med. Inform. Assoc. 9, 262--272.
|
| |
51
|
Zeitlhuber, U., Haschke, F., Puspok, R., Lechner, K., Knapp, W., and Imbach, P. 1984. Hemophilia and thrombocytopenia in a patient with impaired cellular immunity. A case report. Blut. 48, 393--395.
|
|