ACM Home Page
Please provide us with feedback. Feedback
Entity categorization over large document collections
Full text PdfPdf (1.10 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 274-282  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Venkatesh Ganti  Microsoft Research, Redmond, WA, USA
Arnd C. König  Microsoft Research, Redmond, WA, USA
Rares Vernica  University of California, Irvine, Irvine, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 291,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401927
What is a DOI?

ABSTRACT

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (i) considering an entity's context across multiple documents containing it, and (ii) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. Agichtein. Scaling Information Extraction to Large Document Collections. IEEE Data Eng. Bull., 28(4):3--10, 2005.
 
2
E. Agichtein and L. Gravano. Querying Text Databases for efficient Information Extraction. In ICDE, 2003.
 
3
E. Agichtein and S. Sarawagi. Scalable Information Extraction and integration. In ACM SIGKDD, 2006.
 
4
D. E. Appelt and D. Israel. Introduction to Information Extraction Technology. IJCAI-99 Tutorial, 1999.
5
 
6
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, pages 2670--2676, 2007.
7
 
8
M. Cafarella, M. Banko, and O. Etzioni. Relational Web Search. In WWW Conference, 2006.
9
 
10
 
11
W. Cohen and A. McCallum. Information Extraction and Integration: an Overview. In SIGKDD, 2004.
 
12
13
 
14
D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In IJCAI, 2005.
 
15
R. Feldman, B. Rosenfeld, S. Soderland, and O. Etzioni. Self-supervised Relation Extraction from the Web. In ISMIS, 2006.
16
17
18
 
19
20
 
21
 
22
23
 
24
W. Winkler. The State of Record Linkage and Current Research Problems. Technical report, U.S. Bureau of the Census, 1999.
 
25


Collaborative Colleagues:
Venkatesh Ganti: colleagues
Arnd C. König: colleagues
Rares Vernica: colleagues