| A grammar-based entity representation framework for data cleaning |
| Full text |
Pdf
(1.36 MB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 6: entity resolution
table of contents
Pages 233-244
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 55, Downloads (12 Months): 208, Citation Count: 0
|
|
|
ABSTRACT
Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing. We empirically study the utility of the framework over several real-world data cleaning scenarios and find that with the right normalization, often the need for further cleansing is minimized.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , Monica S. Lam , Ravi Sethi , Jeffrey D. Ullman, Compilers: Principles, Techniques, and Tools (2nd Edition), Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 2006
|
| |
2
|
|
 |
3
|
|
 |
4
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
 |
5
|
|
 |
6
|
Amit Chandel , Oktie Hassanzadeh , Nick Koudas , Mohammad Sadoghi , Divesh Srivastava, Benchmarking declarative approximate selection predicates, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
[doi> 10.1145/1247480.1247521]
|
 |
7
|
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
]]S. Cucerzan. Large scale named entity disambiguation based on wikipedia data. In EMNLP-CoNLL Joint Conf., pages 708--716, June 2007.
|
| |
12
|
]]DBLP. http://www.informatik.uni-trier.de/~ley/db/index.html.
|
| |
13
|
|
| |
14
|
]]Freebase. http://www.freebase.com/.
|
| |
15
|
]]A. Y. Halevy, M. J. Franklin, and D. Maier. Dataspaces: A new abstraction for information management. In DASFAA, pages 1--2, Apr. 2006.
|
 |
16
|
|
| |
17
|
]]J. Madhavan, S. Cohen, X. L. Dong, et al. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, Jan. 2007.
|
| |
18
|
]]RIDDLE: Repository of information on duplicate detection, record linkage, and identity uncertainty. http://www.cs.utexas.edu/users/ml/riddle/.
|
| |
19
|
]]E. A. Rundensteiner. Special issue editor. IEEE Data Engineering Bulletin, 22(1), 1999.
|
| |
20
|
]]S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, Dec. 2004.
|
| |
21
|
|
| |
22
|
|
| |
23
|
]]U.S. census bureau. http://www.census.gov/genealogy/names/.
|
 |
24
|
|
| |
25
|
]]Wiktionary. http://www.wiktionary.org/.
|
 |
26
|
|
|