|
ABSTRACT
In this paper, we consider two important problems that commonly occur in bibliographic digital libraries, which seriously degrade their data qualities: Mixed Citation (MC) problem (i.e., citations of different scholars with their names being homonyms are mixed together) and Split Citation (SC) problem (i.e., citations of the same author appear under different name variants). In particular, we investigate an effective yet scalable solution since citations in such digital libraries tend to be large-scale. After formally defining the problems and accompanying challenges, we present an effective solution that is based on the state-of-the-art sampling-based approximate join algorithm. Our claim is verified through preliminary experimental results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. "Eliminating Fuzzy Duplicates in Data Warehouses". In VLDB, 2002.
|
 |
2
|
|
| |
3
|
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive Name-Matching in Information Integration". IEEE Intelligent System, 18(5):16--23, 2003.
|
 |
4
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
 |
5
|
|
| |
6
|
W. Cohen, P. Ravikumar, and S. Fienberg. "A Comparison of String Distance Metrics for Name-matching tasks". In IIWeb Workshop held in conjunction with IJCAI, 2003.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.
|
 |
11
|
|
 |
12
|
Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
[doi> 10.1145/996350.996419]
|
 |
13
|
|
| |
14
|
Y. Hong, B.-W. On, and D. Lee. "System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach". In ECDL, 2004.
|
| |
15
|
|
| |
16
|
M. A. Jaro. "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida". J. of the American Statistical Association, 84(406), 1989.
|
| |
17
|
R. P. Kelley. "Blocking Considerations for Record Linkage Under Conditions of Uncertainty". In Proc. of Social Statistics Section, pages 602--605, 1984.
|
| |
18
|
|
 |
19
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
 |
20
|
Byung-Won On , Dongwon Lee , Jaewoo Kang , Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
[doi> 10.1145/1065385.1065463]
|
| |
21
|
H. Pasula et al. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.
|
 |
22
|
|
 |
23
|
|
 |
24
|
|
| |
25
|
W. E. Winkler and Y. Thibaudeau. "An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census". Technical report, US Bureau of the Census, 1991.
|
CITED BY 8
|
|
|
|
|
Moisés G. de Carvalho , Marcos André Gonçalves , Alberto H. F. Laender , Altigran S. da Silva, Learning to deduplicate, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
Yang Song , Jian Huang , Isaac G. Councill , Jia Li , C. Lee Giles, Efficient topic-based unsupervised name disambiguation, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
In-Su Kang , Seung-Hoon Na , Seungwoo Lee , Hanmin Jung , Pyung Kim , Won-Kyung Sung , Jong-Hyeok Lee, On co-authorship for author disambiguation, Information Processing and Management: an International Journal, v.45 n.1, p.84-97, January, 2009
|
|
|
Alberto H.F. Laender , Marcos André Gonçalves , Ricardo G. Cota , Anderson A. Ferreira , Rodrygo L.T. Santos , Allan J.C. Silva, Keeping a digital library clean: new solutions to old problems, Proceeding of the eighth ACM symposium on Document engineering, September 16-19, 2008, Sao Paulo, Brazil
|
|
|
Denilson Alves Pereira , Berthier Ribeiro-Neto , Nivio Ziviani , Alberto H.F. Laender , Marcos André Gonçalves , Anderson A. Ferreira, Using web information for author name disambiguation, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|