ACM Home Page
Please provide us with feedback. Feedback
Rule based synonyms for entity extraction from noisy text
Full text PdfPdf (117 KB)
Source AND; Vol. 303 archive
Proceedings of the second workshop on Analytics for noisy unstructured text data table of contents
Singapore
Pages 31-38  
Year of Publication: 2008
ISBN:978-1-60558-196-5
Authors
Rema Ananthanarayanan  IBM India Research Lab, New Delhi, India
Vijil Chenthamarakshan  IBM India Research Lab, Bangalore, India
Prasad M Deshpande  IBM India Research Lab, Bangalore, India
Raghuram Krishnapuram  IBM India Research Lab, Bangalore, India
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 147,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1390749.1390756
What is a DOI?

ABSTRACT

Identification of named entities such as person, organization and product names from text is an important task in information extraction. In many domains, the same entity could be referred to in multiple ways due to variations introduced by different user groups, variations of spellings across regions or cultures, usage of abbreviations, typographical errors and other reasons associated with conventional usage. Identifying a piece of text as a mention of an entity in such noisy data is difficult, even if we have a dictionary of possible entities. Previous approaches treat the synonym problem as part entity disambiguation and use learning-based methods that use the context of the words to identify synonyms. In this paper, we show that existing domain knowledge, encoded as rules, can be used effectively to address the synonym problem to a considerable extent. This makes the disambiguation task simpler, without the need for much training data. We look at a subset of application scenarios in named entity extraction, categorize the possible variations in entity names, and define rules for each category. Using these rules, we generate synonyms for the canonical list and match these synonyms to the actual occurrence in the data sets. In particular, we describe the rule categories that we developed for several named entities and report the results of applying our technique of extracting named entities by generating synonyms for two different domains.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
M. Blume. Automatic entity disambiguation: Benefits to ner, relation extraction, link analysis, and inference. In International Conference on Intelligence Analysis, 2005.
 
5
R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In EACL, 2006.
 
6
S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 708--716, 2007.
 
7
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.
 
8
J. Hassell, B. Aleman-Meza, and I. B. Arpinar. Ontology-driven automatic entity disambiguation in unstructured text. In International Semantic Web Conference, pages 44--57, 2006.
 
9
M. J. H. Lee and Y. J. Lee. Information retrieval based-on conceptual distance in is-a hierarchies. Journal of Documentation, 49:188--207, 1993.
 
10
J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. CoRR, cmp-lg/9709008, 1997.
 
11
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.
 
12
 
13
D. Lin, S. Zhao, L. Qin, and M. Zhou. Identifying synonyms among distributionally similar words. In IJCAI, pages 1492--1493, 2003.
 
14
B. Malin. Unsupervised name disambiguation via social network similarity. In Workshop on Link Analysis, Counterterrorism, and Security, ICDM, pages 93--102, 2005.
 
15
 
16
 
17
Y. Ravin and Z. Kazi. Is hillary rodham clinton the president? In ACL Workshop on Coreference and it's Applications, 1999.
 
18
P. Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. (JAIR), 11:95--130, 1999.
 
19
R. Richardson, A. F. Smeaton, and J. Murphy. Using WordNet as a knowledge base for measuring semantic similarity between words. Technical Report CA-1294, Dublin City University, Dublin, Ireland, 1994.
 
20
J. Toole. A hybrid approach to the identification and expansion of abbreviations, 2000.
 
21
 
22
 
23
H. Yu, V. Hatzivassiloglou, C. Friedman, A. Rzhetsky, and W. Wilbur. Automatic extraction of gene and protein synonyms from medline and journal articles, 2002.

Collaborative Colleagues:
Rema Ananthanarayanan: colleagues
Vijil Chenthamarakshan: colleagues
Prasad M Deshpande: colleagues
Raghuram Krishnapuram: colleagues