|
ABSTRACT
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries---more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and high-performance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML), 2003.
|
| |
3
|
|
 |
4
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
| |
5
|
A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Association for Computational Linguistics., 1998.
|
| |
6
|
R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K. Ramani, and Y. W. Wong. Learning to extract proteins and their interactions from medline abstracts. Available from http://www.cs.utexas.edu/users/ml/publication/ie.html, 2002.
|
| |
7
|
R. Bunescu, R. Ge, R. J. Mooney, E. Marcotte, and A. K. Ramani. Extracting gene and protein names from biomedical abstracts. Unpublished Technical Note, Available from http://www.cs.utexas.edu/users/ml/publication/ie.html, 2002.
|
| |
8
|
|
| |
9
|
W. W. Cohen and P. Ravikumar. Secondstring: An open-source Java toolkit of approximate string-matching techniques. Project web page, http://secondstring.sourceforge.net, 2003.
|
| |
10
|
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003.
|
| |
11
|
|
| |
12
|
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP99), College Park, MD, 1999.
|
| |
13
|
|
| |
14
|
|
| |
15
|
R. Durban, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis - Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, 1998.
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
D. Hanisch, J. Fluck, H. Mevissen, and R. Zimmer. Playing biology's name game: identifying protein names in scientific text. In Pac Symp Biocomput, pages 403--14, 2003.
|
| |
20
|
K. Humphreys, G. Demetriou, and R. Gaizauskas. Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. In Proceedings of 2000 the Pacific Symposium on Biocomputing (PSB-2000), pages 502--513, 2000.
|
| |
21
|
|
| |
22
|
R. E. Kraut, S. R. Fussell, F. J. Lerch, and J. A. Espinosa. Coordination in teams: evi-dence from a simulated management game. To appear in the Journal of Organizational Behavior, 2004.
|
| |
23
|
M. Krauthammer, A. Rzhetsky, P. Morozov, and C. Friedman. Using blast for identifying gene and protein names in journal articles. Gene, 259(1-2):245--52, 2000.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
| |
31
|
|
 |
32
|
|
| |
33
|
K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.
|
| |
34
|
|
| |
35
|
|
| |
36
|
L. Sweeney. Finding lists of people on the web. Technical Report CMU-CS-03-168, CMU-ISRI-03-104, Carnegie Mellon University School of Computer Science, 2003. Available from: http://privacy.cs.cmu.edu/dataprivacy/projects/rosterfinder/.
|
| |
37
|
W. E. Winkler. Matching and record linkage. In Business Survey methods. Wiley, 1995.
|
| |
38
|
R. Y. Winston Lin and R. Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In Proceedings of the ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, D.C, August 2003.
|
CITED BY 27
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jun Zhu , Bo Zhang , Zaiqing Nie , Ji-Rong Wen , Hsiao-Wuen Hon, Webpage understanding: an integrated approach, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wei Wang , Chuan Xiao , Xuemin Lin , Chengqi Zhang, Efficient approximate entity extraction with edit distance constraints, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
Sanjay Agrawal , Kaushik Chakrabarti , Surajit Chaudhuri , Venkatesh Ganti , Arnd Christian Konig , Dong Xin, Exploiting web search engines to search structured databases, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
|
|
|
|
|