| Information extraction from Wikipedia: moving down the long tail |
| Full text |
Pdf
(490 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Las Vegas, Nevada, USA
SESSION: Research papers
table of contents
Pages 731-739
Year of Publication: 2008
ISBN:978-1-60558-193-4
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 69, Downloads (12 Months): 445, Citation Count: 5
|
|
|
ABSTRACT
Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Proceedings of IJCAI07, 2007.
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
W. Dakka and S. Cucerzan. Augmenting wikipedia with named entity tags. In Proceedings of IJCNLP 2008, 2008.
|
 |
8
|
Stephen Dill , Nadav Eiron , David Gibson , Daniel Gruhl , R. Guha , Anant Jhingran , Tapas Kanungo , Sridhar Rajagopalan , Andrew Tomkins , John A. Tomlin , Jason Y. Zien, SemTag and seeker: bootstrapping the semantic web via automated semantic annotation, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
[doi> 10.1145/775152.775178]
|
 |
9
|
Susan Dumais , Michele Banko , Eric Brill , Jimmy Lin , Andrew Ng, Web question answering: is more always better?, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, August 11-15, 2002, Tampere, Finland
[doi> 10.1145/564376.564428]
|
| |
10
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Unsupervised named-entity extraction from the web: an experimental study, Artificial Intelligence, v.165 n.1, p.91-134, June 2005
[doi> 10.1016/j.artint.2005.03.001]
|
| |
11
|
J. Giles. Internet encyclopaedias go head to head. Nature, 438:900--901, December 2005.
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
C. Matuszek, M. Witbrock, R. Kahlert, J. Cabral, D. Schneider, P. Shah, and D. Lenat. Searching for common sense: Populating Cyc from the Web. In Proceedings of AAAI05, 2005.
|
| |
16
|
|
| |
17
|
L. K. McDowell and M. Cafarella. Ontology-driven information extraction with ontosyphon. In Proceedings of ISWC06, 2006.
|
| |
18
|
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of Workshop on Machine Learning for Information Filtering, IJCAI99, 1999.
|
| |
19
|
D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, pages 169--198, 1999.
|
| |
20
|
S. Patwardhan and E. Riloff. Effective information extraction with semantic affinity patterns and relevant regions. In Proceedings of EMNLP07, 2007.
|
| |
21
|
|
| |
22
|
E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of EMNLP97, 1997.
|
| |
23
|
|
| |
24
|
C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability), 2002.
|
 |
25
|
|
 |
26
|
|
 |
27
|
|
CITED BY 5
|
|
Raphael Hoffmann , Saleema Amershi , Kayur Patel , Fei Wu , James Fogarty , Daniel S. Weld, Amplifying community content creation with mixed initiative information extraction, Proceedings of the 27th international conference on Human factors in computing systems, April 04-09, 2009, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Daniel S. Weld , Fei Wu , Eytan Adar , Saleema Amershi , James Fogarty , Raphael Hoffmann , Kayur Patel , Michael Skinner, Intelligence in wikipedia, Proceedings of the 23rd national conference on Artificial intelligence, p.1609-1614, July 13-17, 2008, Chicago, Illinois
|
|