| Entity categorization over large document collections |
| Full text |
Pdf
(1.10 MB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Las Vegas, Nevada, USA
SESSION: Research papers
table of contents
Pages 274-282
Year of Publication: 2008
ISBN:978-1-60558-193-4
|
|
Authors
|
|
Venkatesh Ganti
|
Microsoft Research, Redmond, WA, USA
|
|
Arnd C. König
|
Microsoft Research, Redmond, WA, USA
|
|
Rares Vernica
|
University of California, Irvine, Irvine, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 19, Downloads (12 Months): 291, Citation Count: 1
|
|
|
ABSTRACT
Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (i) considering an entity's context across multiple documents containing it, and (ii) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Agichtein. Scaling Information Extraction to Large Document Collections. IEEE Data Eng. Bull., 28(4):3--10, 2005.
|
| |
2
|
E. Agichtein and L. Gravano. Querying Text Databases for efficient Information Extraction. In ICDE, 2003.
|
| |
3
|
E. Agichtein and S. Sarawagi. Scalable Information Extraction and integration. In ACM SIGKDD, 2006.
|
| |
4
|
D. E. Appelt and D. Israel. Introduction to Information Extraction Technology. IJCAI-99 Tutorial, 1999.
|
 |
5
|
|
| |
6
|
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, pages 2670--2676, 2007.
|
 |
7
|
|
| |
8
|
M. Cafarella, M. Banko, and O. Etzioni. Relational Web Search. In WWW Conference, 2006.
|
 |
9
|
|
| |
10
|
|
| |
11
|
W. Cohen and A. McCallum. Information Extraction and Integration: an Overview. In SIGKDD, 2004.
|
| |
12
|
|
 |
13
|
|
| |
14
|
D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In IJCAI, 2005.
|
| |
15
|
R. Feldman, B. Rosenfeld, S. Soderland, and O. Etzioni. Self-supervised Relation Extraction from the Web. In ISMIS, 2006.
|
 |
16
|
|
 |
17
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142504]
|
 |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
Benjamin Rosenfeld , Ronen Feldman , Moshe Fresko , Jonathan Schler , Yonatan Aumann, TEG: a hybrid approach to information extraction, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
[doi> 10.1145/1031171.1031280]
|
| |
24
|
W. Winkler. The State of Record Linkage and Current Research Problems. Technical report, U.S. Bureau of the Census, 1999.
|
| |
25
|
|
|