|
ABSTRACT
Identity resolution aims at identifying the newly presented facts and linking them to their previous mentions. Our main hypothesis is that variations of one and the same fact can be recognised, duplications removed and their aggregation actually increases the correctness of fact extraction. Our approach to the identity problem has been implemented as Identity Resolution Framework (IdRF). The framework provides a general solution identifying known and new facts in specific domains, and it can be used in different applications for processing of different types of entity. It uses an ontology for internal and resulting knowledge representational formalism. The ontology not only contains the representation of the domain, but also known entities and properties. Apart from extracting information from textual sources, we also exploit structured information available in databases mapping the database schema to the ontology and populating the ontology with existing knowledge. Our main goal is not to advocate one criterion among the others, but to introduce widely applicable solution of the identity resolution problem, we present a set of customisable criteria as well as a mechanism new criteria to be added. We have carried two series of experiments in two different business intelligence domains - company profiling and recruitment - achieving rather encouraging result.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Niraj Aswani, Kalina Bontcheva, and Hamish Cunningham. Mining information for instance unification. In International Semantic Web Conference, 2006.
|
| |
2
|
A. Bagga and A. Biermann. A methodology for cross-document coreference. In Proceedings of the Fifth Joint Conference on Information Sciences, pages 207--210, 2000.
|
| |
3
|
|
| |
4
|
Mikhail Bilenko and Raymond J. Mooney. Employing trainable string similarity metrics for information integration. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 67--72, Acapulco, Mexico, August 2003.
|
| |
5
|
N. Chinchor. Overview of muc-7. In In Proceedings of MUC-7, 1998.
|
 |
6
|
Stephen Dill , Nadav Eiron , David Gibson , Daniel Gruhl , R. Guha , Anant Jhingran , Tapas Kanungo , Sridhar Rajagopalan , Andrew Tomkins , John A. Tomlin , Jason Y. Zien, SemTag and seeker: bootstrapping the semantic web via automated semantic annotation, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
[doi> 10.1145/775152.775178]
|
| |
7
|
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. Technical report, TKDE, January 2007.
|
| |
8
|
Norberto Fernandez, Jose M. Blazquez, Jesus A. Fisteus, Luis Sanchez, Michael Sintek, Ansgar Bernardi, Manuel Fuentes, Angelo Marrara, and Zohar Ben-Ashe. News: Bringing semantic web technologies into news agencies. In International Semantic Web Conference, 2006.
|
| |
9
|
Adam Funk, Diana Maynard, Horacio Saggion, and Kalina Bontcheva. Ontological integration of information extraction from multiple sources. In International Workshop on Multi-source, Multi-lingual Information Extraction and Summarisaton, 2007.
|
| |
10
|
Fausto Giunchiglia, Pavel Shvaiko, and Mikalai Yatskevich. S-match: an algorithm and an implementation of semantic matching. In ESWS, pages 61--75, 2004.
|
| |
11
|
Chong Jeong Gooi and James Allan. Cross-document coreference on a large scale corpus. In Proceedings of the Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting, Boston, 2004.
|
| |
12
|
|
| |
13
|
|
| |
14
|
Atanas Kiryakov, Damyan Ognyanov, and Dimitar Mano. Owlim --- a pragmatic semantic repository for owl. In SSWS 2005, WISE, USA, 2005.
|
| |
15
|
Michal C. A. Klein, Peter Mika, and Stefan Schlobach. Approximate instance unification using roughowl. 2007. submitted.
|
| |
16
|
|
| |
17
|
Wendy Lehnert , Claire Cardie , David Fisher , Ellen Riloff , Robert Williams, University of Massachusetts: MUC-3 test results and analysis, Proceedings of the 3rd conference on Message understanding, May 21-23, 1991, San Diego, California
[doi> 10.3115/1071958.1071978]
|
| |
18
|
W. Lehnert , C. Cardie , D. Fisher , J. McCarthy , E. Riloff , S. Soderland, University of Massachusetts: MUC-4 test results and analysis, Proceedings of the 4th conference on Message understanding, June 16-18, 1992, McLean, Virginia
[doi> 10.3115/1072064.1072087]
|
| |
19
|
|
| |
20
|
D. Maynard, M. Yankova, A. Kourakis, and A. Kokossis. Ontology-based information extraction for market monitoring and technology watch. In ESWC Workshop "End User Apects of the Semantic Web", Heraklion, Crete, 2005.
|
 |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
H. Saggion. Experiments on semantic-based clustering for cross-document coreference. In International Joint Conference on Natural Language Processing, Hyderabad, India, January 2008. AFNLP.
|
| |
25
|
H. Saggion, J. Kuper, T. Declerck, D. Reidsma, and H. Cunningham. Intelligent multimedia indexing and retrieval through multi-source information extraction and merging. In IJCAI 2003, Acapulco, Mexico, 2003.
|
| |
26
|
Ivan Terziev, Atanas Kiryakov, and Dimitar Mano. Base upper-level ontology (bulo) guidance. Technical Report Deliverable 1.8.1, SEKT project, UK, July 2005.
|
| |
27
|
K. Yang, J. Jiang, H. Lee, and J. Ho. Extracting citation relationships from web documents for author disambiguation. Technical Report TR-IIS-06-017, Institute of Information Science, Academia Sinica Taipei Taiwan, December 2006.
|
|