ACM Home Page
Please provide us with feedback. Feedback
A grammar-based entity representation framework for data cleaning
Full text PdfPdf (1.36 MB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 6: entity resolution table of contents
Pages 233-244  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Arvind Arasu  Microsoft Research, Redmond, WA, USA
Raghav Kaushik  Microsoft Research, Redmond, WA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 51,   Downloads (12 Months): 224,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559871
What is a DOI?

ABSTRACT

Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing. We empirically study the utility of the framework over several real-world data cleaning scenarios and find that with the right normalization, often the need for further cleansing is minimized.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
4
5
6
7
8
9
 
10
 
11
S. Cucerzan. Large scale named entity disambiguation based on wikipedia data. In EMNLP-CoNLL Joint Conf., pages 708--716, June 2007.
 
12
DBLP. http://www.informatik.uni-trier.de/~ley/db/index.html.
 
13
 
14
Freebase. http://www.freebase.com/.
 
15
A. Y. Halevy, M. J. Franklin, and D. Maier. Dataspaces: A new abstraction for information management. In DASFAA, pages 1--2, Apr. 2006.
16
 
17
J. Madhavan, S. Cohen, X. L. Dong, et al. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, Jan. 2007.
 
18
RIDDLE: Repository of information on duplicate detection, record linkage, and identity uncertainty. http://www.cs.utexas.edu/users/ml/riddle/.
 
19
E. A. Rundensteiner. Special issue editor. IEEE Data Engineering Bulletin, 22(1), 1999.
 
20
S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, Dec. 2004.
 
21
 
22
 
23
U.S. census bureau. http://www.census.gov/genealogy/names/.
24
 
25
Wiktionary. http://www.wiktionary.org/.
26

Collaborative Colleagues:
Arvind Arasu: colleagues
Raghav Kaushik: colleagues