ACM Home Page
Please provide us with feedback. Feedback
Information extraction from Wikipedia: moving down the long tail
Full text PdfPdf (490 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 731-739  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Fei Wu  University of Washington, Seattle, WA, USA
Raphael Hoffmann  University of Washington, Seattle, WA, USA
Daniel S. Weld  University of Washington, Seattle, WA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 69,   Downloads (12 Months): 445,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401978
What is a DOI?

ABSTRACT

Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Proceedings of IJCAI07, 2007.
4
 
5
6
 
7
W. Dakka and S. Cucerzan. Augmenting wikipedia with named entity tags. In Proceedings of IJCNLP 2008, 2008.
8
9
 
10
 
11
J. Giles. Internet encyclopaedias go head to head. Nature, 438:900--901, December 2005.
12
 
13
14
 
15
C. Matuszek, M. Witbrock, R. Kahlert, J. Cabral, D. Schneider, P. Shah, and D. Lenat. Searching for common sense: Populating Cyc from the Web. In Proceedings of AAAI05, 2005.
 
16
 
17
L. K. McDowell and M. Cafarella. Ontology-driven information extraction with ontosyphon. In Proceedings of ISWC06, 2006.
 
18
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of Workshop on Machine Learning for Information Filtering, IJCAI99, 1999.
 
19
D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, pages 169--198, 1999.
 
20
S. Patwardhan and E. Riloff. Effective information extraction with semantic affinity patterns and relevant regions. In Proceedings of EMNLP07, 2007.
 
21
 
22
E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of EMNLP97, 1997.
 
23
 
24
C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability), 2002.
25
26
27


Collaborative Colleagues:
Fei Wu: colleagues
Raphael Hoffmann: colleagues
Daniel S. Weld: colleagues