ACM Home Page
Please provide us with feedback. Feedback
Efficient approximate entity extraction with edit distance constraints
Full text PdfPdf (686 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 20: data management pearls table of contents
Pages 759-770  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Wei Wang  University of New South Wales and NICTA, Sydney, Australia
Chuan Xiao  University of New South Wales and NICTA, Sydney, Australia
Xuemin Lin  University of New South Wales and NICTA, Sydney, Australia
Chengqi Zhang  University of Technology, Sydney, Sydney, Australia
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 39,   Downloads (12 Months): 160,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559925
What is a DOI?

ABSTRACT

Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance.

In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
5
6
 
7
8
9
 
10
 
11
 
12
13
 
14
A. M. Cohen. Unsupervised gene/protein entity normalization using automatically extracted dictionaries. In Proceedings of the BioLINK2005 Workshop 2005.
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
24
 
25
E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica 12(4/5):345--374, 1994.
26
 
27
 
28
G. Navarro, R. A. Baeza-Yates, E. Sutinen, andJ. Tarhio. Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4):19--27, 2001.
29
 
30
B. -W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE pages 496--505, 2007.
31
32
 
33
B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007. 02, Department of Informatics, University of Zurich, April 2007.
 
34
L. Tanabe and W. J. Wilbur. Generation of a large gene/protein lexicon by morphological pattern analysis. Journal of Bioinformatics and Computational Biology 1(4):1--16, 2004.
 
35
 
36
 
37
E. Ukkonen. Finding approximate patterns in strings. J. Algorithms 6(1):132--137, 1985.
 
38
J. Wang, Z. Li, C. Cai, and Y. Chen. Assessment of approximate string matching in a biomedical text retrieval problem. Computers in Biology and Medicine 35(8):717--724, 2005.
 
39
40
41
 
42

Collaborative Colleagues:
Wei Wang: colleagues
Chuan Xiao: colleagues
Xuemin Lin: colleagues
Chengqi Zhang: colleagues