|
ABSTRACT
An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database integration, and may cause improper attribution of credit. Proposed here is an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. The approach utilizes three types of citation attributes: co-author names, paper titles, and publication venue titles 2. The approach is illustrated with 16 name datasets with citations collected from the DBLP database bibliography and author home pages and shows that name disambiguation can be achieved using these citation attributes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Digital bibliography & library project. http://WWW.Informatik.Uni-Trier.DE/ley/db/index.html.
|
| |
2
|
Getty's ULAN (Union List of Artist's Names). http://www.getty.edu/research/conducting research/vocabularies/ulan/.
|
| |
3
|
The library of congress name authority file. http://www.loc.gov/marc/authority/index.html.
|
 |
4
|
Yossi Azar , Amos Fiat , Anna Karlin , Frank McSherry , Jared Saia, Spectral analysis of data, Proceedings of the thirty-third annual ACM symposium on Theory of computing, p.619-626, July 2001, Hersonissos, Greece
[doi> 10.1145/380752.380859]
|
| |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, 2003.
|
 |
10
|
|
| |
11
|
L. K. Branting. Name-matching algorithms for legal case-management systems. Journal of Information, Law and Technology (JILT), 1, 2002.
|
| |
12
|
|
 |
13
|
William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347141]
|
| |
14
|
|
| |
15
|
L. Daniel and J. Slezak. Street talk: the word on address-matching. Business Geographics, pages 26--33, 1995.
|
| |
16
|
|
| |
17
|
T. DiLauro, G. S. Choudhury, M. Patton, J. W. Warner, and E. W. Brown. Automated name authority control and enhanced searching in the levy collection. D-Lib Magazine, 7(4), 2001.
|
| |
18
|
W. B. Dolan. Word sense ambiguation: Clustering related senses. Technical report, 1994.
|
| |
19
|
P. Drineas , Alan Frieze , Ravi Kannan , Santosh Vempala , V. Vinay, Clustering in large graphs and matrices, Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms, p.291-299, January 17-19, 1999, Baltimore, Maryland, United States
|
| |
20
|
D. G. Feitelson. On identifying name equivalences in digital libraries. Information Research, 9(4):192, 2004.
|
| |
21
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
|
| |
22
|
|
 |
23
|
C. Lee Giles , Kurt D. Bollacker , Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276685]
|
| |
24
|
P. Gillman. National name authority file: Report to the national council on archives. Technical Report British Library Research and Innovation Report 91, The British Library Board, 1998.
|
| |
25
|
Hui Han , C. Lee Giles , Eren Manavoglu , Hongyuan Zha , Zhenyue Zhang , Edward A. Fox, Automatic document metadata extraction using support vector machines, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
 |
26
|
Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
[doi> 10.1145/996350.996419]
|
| |
27
|
|
| |
28
|
|
| |
29
|
J. Karlgren and M. Sahlgren. From words to understanding. In Kanerva et al. (eds.) Foundations of Real World Intelligence. CSLI publications, pages 294--308, 2001.
|
 |
30
|
|
 |
31
|
Mong Li Lee , Tok Wang Ling , Wai Lup Low, IntelliClean: a knowledge-based intelligent data cleaner, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.290-294, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347154]
|
| |
32
|
|
| |
33
|
|
 |
34
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
35
|
A. E. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Research Issues on Data Mining and Knowledge Discovery, pages 23--29, 1997.
|
| |
36
|
A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Proceedings of Advances in Neural Information Processing Systems, pages 849--856, 2001.
|
| |
37
|
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Proceedings of Neural Information Processing Systems: Natural and Synthetic, number 15, 2002.
|
| |
38
|
|
 |
39
|
Ari Pirkola , Jarmo Toivonen , Heikki Keskustalo , Kari Visala , Kalervo Järvelin, Fuzzy translation of cross-lingual spelling variants, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
[doi> 10.1145/860435.860498]
|
| |
40
|
|
| |
41
|
K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In Proceedings of AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.
|
| |
42
|
|
| |
43
|
M. Skounakis, M. Craven, and S. Ray. Hierarchical hidden markov models for information extraction. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003.
|
| |
44
|
|
 |
45
|
|
| |
46
|
|
| |
47
|
H. R. Turtle and W. B. Croft. Uncertainty in information retrieval systems. Uncertainty Management in Information Systems, pages 189--224, 1996.
|
 |
48
|
|
| |
49
|
|
 |
50
|
|
| |
51
|
Y. Y. Yao, S. Wong, and L. S. Wang. A non-numeric approach to uncertain reasoning. International Journal of General Systems, 23(4):343--359, 1995.
|
| |
52
|
H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for k-means clustering. In Neural Information Processing Systems (NIPS 2001), pages 1057--1064, 2001.
|
 |
53
|
Hongyuan Zha , Xiaofeng He , Chris Ding , Horst Simon , Ming Gu, Bipartite graph partitioning and data clustering, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502585.502591]
|
CITED BY 16
|
|
|
|
|
|
|
|
Isaac G. Councill , Huajing Li , Ziming Zhuang , Sandip Debnath , Levent Bolelli , Wang Chien Lee , Anand Sivasubramaniam , C. Lee Giles, Learning metadata from the evidence in an on-line citation matching scheme, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
Xiaonan Lu , Prasenjit Mitra , James Z. Wang , C. Lee Giles, Automatic categorization of figures in scientific documents, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
Denilson Alves Pereira , Berthier Ribeiro-Neto , Nivio Ziviani , Alberto H.F. Laender , Marcos André Gonçalves , Anderson A. Ferreira, Using web information for author name disambiguation, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|
|
Yang Song , Jian Huang , Isaac G. Councill , Jia Li , C. Lee Giles, Efficient topic-based unsupervised name disambiguation, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jie Tang , Jing Zhang , Limin Yao , Juanzi Li , Li Zhang , Zhong Su, ArnetMiner: extraction and mining of academic social networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
In-Su Kang , Seung-Hoon Na , Seungwoo Lee , Hanmin Jung , Pyung Kim , Won-Kyung Sung , Jong-Hyeok Lee, On co-authorship for author disambiguation, Information Processing and Management: an International Journal, v.45 n.1, p.84-97, January, 2009
|
|
|
Alberto H.F. Laender , Marcos André Gonçalves , Ricardo G. Cota , Anderson A. Ferreira , Rodrygo L.T. Santos , Allan J.C. Silva, Keeping a digital library clean: new solutions to old problems, Proceeding of the eighth ACM symposium on Document engineering, September 16-19, 2008, Sao Paulo, Brazil
|
|
|
|
|
|
|
|
|
|
|