ACM Home Page
Please provide us with feedback. Feedback
Disambiguating authors in academic publications using random forests
Full text PdfPdf (994 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries table of contents
Austin, TX, USA
SESSION: 2 table of contents
Pages 39-48  
Year of Publication: 2009
ISBN:978-1-60558-322-8
Authors
Pucktada Treeratpituk  Pennslyvania State University, University Park, PA, USA
C. Lee Giles  Pennslyvania State University, University Park, PA, USA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 35,   Downloads (12 Months): 102,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1555400.1555408
What is a DOI?

ABSTRACT

Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pair-wise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
 
5
 
6
 
7
C. Chang and C. Lin. Libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.
 
8
P. Christen. A comparison of personal name matching: Techniques and practical issues. Workshop on Mining Complex Data (MCD), 2006.
9
 
10
I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.
11
12
13
 
14
J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. Proc of The European Conf on Principles and Practice of Knowledge Discovery in Databases, 2006.
 
15
 
16
A. Liaw and M. Wiener. Classification and regression by randomforest. R News.
17
 
18
A. Monge and C. Elkan. An Efficient domain-independent algorithm for detecting approximately duplicate database records. Proc of SIGMOD, 1997.
19
20
21
 
22
 
23
 
24
 
25
W. Winkler. The state of record linkage and current research problems. Statistics of Income Division, 1999.
 
26
W. Winkler. Approximate string comparator search strategies for very large administrative lists. Proc of the Section on Survey Research Methods, 2004.
 
27

Collaborative Colleagues:
Pucktada Treeratpituk: colleagues
C. Lee Giles: colleagues