| Efficient approximate entity extraction with edit distance constraints |
| Full text |
Pdf
(686 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 20: data management pearls
table of contents
Pages 759-770
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
Wei Wang
|
University of New South Wales and NICTA, Sydney, Australia
|
|
Chuan Xiao
|
University of New South Wales and NICTA, Sydney, Australia
|
|
Xuemin Lin
|
University of New South Wales and NICTA, Sydney, Australia
|
|
Chengqi Zhang
|
University of Technology, Sydney, Sydney, Australia
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 39, Downloads (12 Months): 160, Citation Count: 0
|
|
|
ABSTRACT
Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
 |
9
|
Amit Chandel , Oktie Hassanzadeh , Nick Koudas , Mohammad Sadoghi , Divesh Srivastava, Benchmarking declarative approximate selection predicates, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
[doi> 10.1145/1247480.1247521]
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
A. M. Cohen. Unsupervised gene/protein entity normalization using automatically extracted dictionaries. In Proceedings of the BioLINK2005 Workshop 2005.
|
 |
15
|
|
| |
16
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica 12(4/5):345--374, 1994.
|
 |
26
|
|
| |
27
|
|
| |
28
|
G. Navarro, R. A. Baeza-Yates, E. Sutinen, andJ. Tarhio. Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4):19--27, 2001.
|
 |
29
|
|
| |
30
|
B. -W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE pages 496--505, 2007.
|
 |
31
|
|
 |
32
|
|
| |
33
|
B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007. 02, Department of Informatics, University of Zurich, April 2007.
|
| |
34
|
L. Tanabe and W. J. Wilbur. Generation of a large gene/protein lexicon by morphological pattern analysis. Journal of Bioinformatics and Computational Biology 1(4):1--16, 2004.
|
| |
35
|
|
| |
36
|
|
| |
37
|
E. Ukkonen. Finding approximate patterns in strings. J. Algorithms 6(1):132--137, 1985.
|
| |
38
|
J. Wang, Z. Li, C. Cai, and Y. Chen. Assessment of approximate string matching in a biomedical text retrieval problem. Computers in Biology and Medicine 35(8):717--724, 2005.
|
| |
39
|
|
 |
40
|
|
 |
41
|
|
| |
42
|
|
|