| Disambiguating authors in academic publications using random forests |
| Full text |
Pdf
(994 KB)
|
Source
|
International Conference on Digital Libraries
archive
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Austin, TX, USA
Pages 39-48
Year of Publication: 2009
ISBN:978-1-60558-322-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 35, Downloads (12 Months): 102, Citation Count: 0
|
|
|
ABSTRACT
Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pair-wise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Omar Benjelloun , Hector Garcia-Molina , Heng Gong , Hideki Kawai , Tait E. Larson , David Menestrina , Sutthipong Thavisomboon, D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution, Proceedings of the 27th International Conference on Distributed Computing Systems, p.37, June 25-27, 2007
[doi> 10.1109/ICDCS.2007.96]
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
C. Chang and C. Lin. Libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.
|
| |
8
|
P. Christen. A comparison of personal name matching: Techniques and practical issues. Workshop on Mining Complex Data (MCD), 2006.
|
 |
9
|
William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347141]
|
| |
10
|
I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.
|
 |
11
|
Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
[doi> 10.1145/996350.996419]
|
 |
12
|
|
 |
13
|
|
| |
14
|
J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. Proc of The European Conf on Principles and Practice of Knowledge Discovery in Databases, 2006.
|
| |
15
|
|
| |
16
|
A. Liaw and M. Wiener. Classification and regression by randomforest. R News.
|
 |
17
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
18
|
A. Monge and C. Elkan. An Efficient domain-independent algorithm for detecting approximately duplicate database records. Proc of SIGMOD, 1997.
|
 |
19
|
Byung-Won On , Dongwon Lee , Jaewoo Kang , Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
[doi> 10.1145/1065385.1065463]
|
 |
20
|
|
 |
21
|
Yang Song , Jian Huang , Isaac G. Councill , Jia Li , C. Lee Giles, Efficient topic-based unsupervised name disambiguation, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
[doi> 10.1145/1255175.1255243]
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
W. Winkler. The state of record linkage and current research problems. Statistics of Income Division, 1999.
|
| |
26
|
W. Winkler. Approximate string comparator search strategies for very large administrative lists. Proc of the Section on Survey Research Methods, 2004.
|
| |
27
|
|
|