| An efficient filter for approximate membership checking |
| Full text |
Pdf
(387 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Research Session 17: Probabilistic II
table of contents
Pages 805-818
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Kaushik Chakrabarti
|
Microsoft Research, Redmond, WA, USA
|
|
Surajit Chaudhuri
|
Microsoft Research, Redmond, WA, USA
|
|
Venkatesh Ganti
|
Microsoft Research, Redmond, WA, USA
|
|
Dong Xin
|
Microsoft Research, Redmond, WA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 34, Downloads (12 Months): 301, Citation Count: 6
|
|
|
ABSTRACT
We consider the problem of identifying sub-strings of input text strings that approximately match with some member of a potentially large dictionary. This problem arises in several important applications such as extracting named entities from text documents and identifying biological concepts from biomedical literature. In this paper, we develop a filter-verification framework, and propose a novel in-memory filter structure. That is, we first quickly filter out sub-strings that cannot match with any dictionary member, and then verify the remaining sub-strings against the dictionary. Our method does not produce false negatives. We demonstrate the efficiency and effectiveness of our filter over real datasets, and show that it significantly outperforms the previous best-known methods in terms of both filtering power and computation time.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Amihood Amir , Dmitry Keselman , Gad M. Landau , Moshe Lewenstein , Noa Lewenstein , Michael Rodeh, Text indexing and dictionary matching with one error, Journal of Algorithms, v.37 n.2, p.309-325, Nov. 2000
[doi> 10.1006/jagm.2000.1104]
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
Kaushik Chakrabarti , Venkatesh Ganti , Jiawei Han , Dong Xin, Ranking objects based on relationships, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142516]
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
13
|
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129--134, 2000.
|
| |
14
|
|
| |
15
|
M. Narayanan and R. M. Karp. Gapped local similarity search with provable guarantees. In WABI, pages 74--86, 2004.
|
 |
16
|
|
| |
17
|
A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35--43, 2001.
|
| |
18
|
|
| |
19
|
A. C.-C. Yao and F. F. Yao. Dictionary loop-up with small errors. In CPM, pages 387--394, 1995.
|
| |
20
|
X. Zhou, X. Zhang, and X. Hu. Maxmatcher: Biological concept extraction using approximate dictionary lookup. In PRICAI, pages 1145--1149, 2006.
|
CITED BY 6
|
|
|
|
|
|
|
|
Wei Wang , Chuan Xiao , Xuemin Lin , Chengqi Zhang, Efficient approximate entity extraction with edit distance constraints, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
Sanjay Agrawal , Kaushik Chakrabarti , Surajit Chaudhuri , Venkatesh Ganti , Arnd Christian Konig , Dong Xin, Exploiting web search engines to search structured databases, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|