| Establishing value mappings using statistical models and user feedback |
| Full text |
Pdf
(320 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the 14th ACM international conference on Information and knowledge management
table of contents
Bremen, Germany
SESSION: Paper session KM-1 (knowledge management): knowledge systems
table of contents
Pages: 68 - 75
Year of Publication: 2005
ISBN:1-59593-140-6
|
|
Authors
|
|
Jaewoo Kang
|
North Carolina State University, Raleigh, NC
|
|
Tae Sik Han
|
North Carolina State University, Raleigh, NC
|
|
Dongwon Lee
|
Pennsylvania State University, University Park, PA
|
|
Prasenjit Mitra
|
Pennsylvania State University, University Park, PA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 32, Citation Count: 1
|
|
|
ABSTRACT
In this paper, we present a "value mapping" algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures the unique characteristics of values and their co-occurrence. It then finds the matching values by computing the distances between the models while refining the models using user feedback through iterations. Our experimental results suggest that our approach successfully establishes value mappings even in the presence of opaque data values and thus can be a useful addition to the existing data integration techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. "Eliminating Fuzzy Duplicates in Data Warehouses". In VLDB, 2002.
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. "Indexing by Latent Semantic Analysis". J. of the American Society of Information Science, 41(6):391--407, 1990.
|
 |
7
|
Robin Dhamankar , Yoonkyong Lee , AnHai Doan , Alon Halevy , Pedro Domingos, iMAP: discovering complex semantic matches between database schemas, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007612]
|
| |
8
|
|
 |
9
|
AnHai Doan , Pedro Domingos , Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.509-520, May 21-24, 2001, Santa Barbara, California, United States
|
| |
10
|
A. Doan, Y. Lu, Y. Lee, and J. Han. "Object Matching for Data Integration: A Profile-Based Approach". In Workshop on Info. Integration on the Web, 2003.
|
| |
11
|
I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.
|
| |
12
|
H. Galhardas, D. Florescu, D. Shasha, and E. Simon. "An Extensible Framework for Data Cleaning". In IEEE ICDE, 2000.
|
| |
13
|
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. "Text Joins for Data Cleansing and Integration in an RDBMS". In IEEE ICDE, 2003.
|
 |
14
|
Mauricio A. Hernández , Renée J. Miller , Laura M. Haas, Clio: a semi-automatic tool for schema mapping, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.607, May 21-24, 2001, Santa Barbara, California, United States
|
 |
15
|
|
 |
16
|
|
| |
17
|
W.-S. Li and C. Clifton. "SEMINT: A Tool for Identifying Attribute Correspondences in Heterogeneous Databases using Neural Networks". VLDB J., 10(4), Dec. 2001.
|
| |
18
|
|
 |
19
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
20
|
S. Melnik, H. Garcia-Molina, and E. Rahm. "Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching". In IEEE ICDE, 2002.
|
| |
21
|
|
| |
22
|
|
| |
23
|
D. S. Moore and G. P. McCabe. "Introduction to the Practice of Statistics". From Book News, Inc., 1998.
|
| |
24
|
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.
|
| |
25
|
|
| |
26
|
S. Sarawagi and A. Bhamidipaty. "Interactive Deduplication using Active Learning". In ACM SIGMOD, 2002.
|
| |
27
|
|
| |
28
|
W. E. Winkler. "The State of Record Linkage and Current Research Problems". Technical report, US Bureau of the Census, Apr. 1999.
|
|