|
ABSTRACT
The rapid globalization of Wikipedia is generating a parallel, multi-lingual corpus of unprecedented scale. Pages for the same topic in many different languages emerge both as a result of manual translation and independent development. Unfortunately, these pages may appear at different times, vary in size, scope, and quality. Furthermore, differential growth rates cause the conceptual mapping between articles in different languages to be both complex and dynamic. These disparities provide the opportunity for a powerful form of information arbitrage--leveraging articles in one or more languages to improve the content in another. Analyzing four large language domains (English, Spanish, French, and German), we present Ziggurat, an automated system for aligning Wikipedia infoboxes, creating new infoboxes as necessary, filling in missing information, and detecting discrepancies between parallel pages. Our method uses self-supervised learning and our experiments demonstrate the method's feasibility, even in the absence of dictionaries.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Adafre, S. F., and M. de Rijke, "Finding Similar Sentences across Multiple Languages in Wikipedia," EACL '06, Trento, Italy, April 2006.
|
| |
3
|
DBpedia, www.dbpedia.org, last retrieved Aug. 10, 2008.
|
| |
4
|
Etzioni, O., K. Reiter, S. Soderland, and M. Sammer, "Lexical translation with application to image searching on the web." MT Summit XI, Copenhagen, Denmark, September, 2007.
|
| |
5
|
Ferrández, S., A. Toral, Ó. Ferrández, A. Ferrández, and R. Muñoz, "Applying Wikipedia's Multilingual Knowledge to Cross--Lingual Question Answering," Lecture Notes in Computer Science, vol. 4592, Springer, 2007.
|
| |
6
|
Friedman, J., T. Hastie, and R. Tibshirani, "Additive logistic regression: A statistical view of boosting." Annals of Statistics, 28(20), 337--407, 2000
|
| |
7
|
Kawaba, M., H. Nakasaki, T. Utsuro, and T. Fukuhara, "Cross-Lingual Blog Analysis based on Multilingual Blog Distillation from Multilingual Wikipedia Entries," ICWSM'08, Seattle, WA, March 2008.
|
| |
8
|
Kinzler, D., "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia" Thesis, 2008.
|
| |
9
|
|
| |
10
|
|
| |
11
|
Richman, A. E., and P Schone, "Mining Wiki Resources for Multilingual Named Entity Recognition," ACL'08, Columbus, Ohio, June 2008.
|
| |
12
|
Sorg, P., and P. Cimiano, "Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach," AAAI'08 Wikipedia and Artificial Intelligence Workshop, Chicago, IL, July 2008.
|
| |
13
|
|
| |
14
|
Voss, J, "Measuring Wikipedia." 10th International Conference of the International Society for Scientometrics and Informetrics, Stockholm, Sweden. 2005.
|
| |
15
|
Weld, Daniel S., F. Wu, E. Adar, S. Amershi, J. Fogarty, R. Hoffman, K. Patel, and M. Skinner, "Intelligence in Wikipedia," AAAI'08, Chicago, IL, July 2008.
|
| |
16
|
Wentald, W., J. Knopp, C. Silberer, and M. Hartung, "Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration," LREC '08, Marrakech, Morocco, May 2008.
|
| |
17
|
"Wikipedia: MultiLingual Statistics," Aug. 10, 2008 en.wikipedia.org/wiki/Wikipedia:Multilingual_statistics,
|
 |
18
|
|
 |
19
|
|
 |
20
|
|
|