|
ABSTRACT
In this paper, we consider the problem of ambiguous author names in bibliographic citations, and comparatively study alternative approaches to identify and correct such name variants (e.g., "Vannevar Bush" and "V. Vush"). Our study is based on a scalable two-step framework, where step 1 is to substantially reduce the number of candidates via blocking, and step 2 is to measure the distance of two names via coauthor information. Combining four blocking methods and seven distance measures on four data sets, we present extensive experimental results, and identify combinations that are scalable and effective to disambiguate author names in citations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. "Eliminating Fuzzy Duplicates in Data Warehouses". In VLDB, 2002.
|
| |
2
|
arXiv.org e Print archive. http://arxiv.org/.
|
| |
3
|
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive Name-Matching in Information Integration". IEEE Intelligent System, 18(5):16--23, 2003.
|
 |
4
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
 |
5
|
|
| |
6
|
W. Cohen, P. Ravikumar, and S. Fienberg. "A Comparison of String Distance Metrics for Name-matching tasks". In IIWeb Workshop held in conjunction with IJCAI, 2003.
|
| |
7
|
|
| |
8
|
I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.
|
| |
9
|
A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
|
 |
10
|
|
 |
11
|
Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
[doi> 10.1145/996350.996419]
|
 |
12
|
|
| |
13
|
|
| |
14
|
M. A. Jaro. "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida". J. of the American Statistical Association, 84(406), Jun. 1989.
|
| |
15
|
R. P. Kelley. "Blocking Considerations for Record Linkage Under Conditions of Uncertainty". In Proc. of Social Statistics Section, pages 602--605, 1984.
|
| |
16
|
|
| |
17
|
|
| |
18
|
CiteSeer: Scientific Literature Digital Library. http://www.citeseer.org/.
|
| |
19
|
B. Majoros. "Naive Bayes Models for Classification". http://www.geocities.com/ResearchTriangle/Forum/1203/NaiveBayes.html.
|
 |
20
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
21
|
|
| |
22
|
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.
|
| |
23
|
S. Sarawagi and A. Bhamidipaty. "Interactive Deduplication using Active Learning". In ACM SIGMOD, 2002.
|
| |
24
|
SecondString: Open source Java-based Package of Approximate String-Matching. http://secondstring.sourceforge.net/.
|
| |
25
|
|
| |
26
|
W. E. Winkler and Y. Thibaudeau. "An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census". Technical report, US Bureau of the Census, 1991.
|
CITED BY 17
|
|
|
|
|
|
|
|
Byung-Won On , Ergin Elmacioglu , Dongwon Lee , Jaewoo Kang , Jian Pei, An effective approach to entity resolution problem using quasi-clique and its application to digital libraries, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
Xiaonan Lu , Prasenjit Mitra , James Z. Wang , C. Lee Giles, Automatic categorization of figures in scientific documents, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
Denilson Alves Pereira , Berthier Ribeiro-Neto , Nivio Ziviani , Alberto H. F. Laender, Using web information for creating publication venue authority files, Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, June 16-20, 2008, Pittsburgh PA, PA, USA
|
|
|
Su Yan , Dongwon Lee , Min-Yen Kan , Lee C. Giles, Adaptive sorted neighborhood methods for efficient record linkage, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michael L. Wick , Khashayar Rohanimanesh , Karl Schultz , Andrew McCallum, A unified approach for schema matching, coreference and canonicalization, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
Denilson Alves Pereira , Berthier Ribeiro-Neto , Nivio Ziviani , Alberto H.F. Laender , Marcos André Gonçalves , Anderson A. Ferreira, Using web information for author name disambiguation, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|
|
|
|
|
|
|