|
ABSTRACT
The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solution and show significant improvements in terms of effective quality prediction.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
 |
3
|
Fabricio Benevenuto , Tiago Rodrigues , Virgilio Almeida , Jussara Almeida , Chao Zhang , Keith Ross, Identifying video spammers in online social networks, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
[doi> 10.1145/1451983.1451996]
|
| |
4
|
C. Björnsson. Lesbarkeit durch Lix. 1968.
|
 |
5
|
|
| |
6
|
|
| |
7
|
R. Cassel. Selection criteria for internet resources. College and Research Libraries News, 56(2):92--93, 1995.
|
 |
8
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
| |
9
|
C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001.
|
| |
10
|
Y. Chu. Trust management for the world wide web. Master's thesis, MIT, USA, 1997.
|
| |
11
|
M. Coleman and T. L. Liau. A computer readability formula designed for machine scoring. 60(2):283--284, 1975.
|
| |
12
|
P. Dondio, S. Barrett, S. Weber, and J. Seigneur. Extracting trust from domain analysis: A case study on the wikipedia project. pages 362--373. 2006.
|
| |
13
|
|
| |
14
|
T. R. (Ed). Online Collaborative Learning: Theory and Practice. Idea Group Pub, USA, 2004.
|
| |
15
|
R. Flesch. A new readability yardstick. pages 221--235, 1948.
|
 |
16
|
B. J. Fogg , Cathy Soohoo , David R. Danielson , Leslie Marable , Julianne Stanford , Ellen R. Tauber, How do users evaluate the credibility of Web sites?: a study with over 2,500 participants, Proceedings of the 2003 conference on Designing for user experiences, June 06-07, 2003, San Francisco, California
[doi> 10.1145/997078.997097]
|
| |
17
|
R. Gunning. The Technique of Clear Writing. McGraw-Hill International Book Co, 1952.
|
 |
18
|
Meiqun Hu , Ee-Peng Lim , Aixin Sun , Hady Wirawan Lauw , Ba-Quy Vuong, Measuring article quality in wikipedia: models and evaluation, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
[doi> 10.1145/1321440.1321476]
|
 |
19
|
|
| |
20
|
N. Korfiatis, M. Poulos, and G. Bokos. Evaluating authoritative sources using social networks: An insight from wikipedia. Online Information Review, 30(3):252--262, 2006.
|
| |
21
|
A. Krowne. Building a digital library the commons-based peer production way. D-Lib magazine, 9(1082), 2003.
|
| |
22
|
G. H. McLaughlin. Smog grading: A new readability formula. pages 639--646, 1969.
|
| |
23
|
B. Mingus. personal communication, 2008.
|
| |
24
|
|
| |
25
|
S. B. P. Dondio and S. Weber. Calculating the trustworthiness of a wikipedia article using dante methodology. In IADIS e Society Conference, Dublin, Ireland, 2006.
|
| |
26
|
L. Rassbach, T. Pincock, and B. Mingus. Exploring the feasibility of automatically rating online article quality. http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf.
|
| |
27
|
|
| |
28
|
E. A. Smith and R. J. Senter. Automated readability index. 1967.
|
| |
29
|
B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proc. of the ICIQ 2005, pages 442--454, 2005.
|
| |
30
|
|
| |
31
|
K. H. Veltman. Access, claims and quality on the internet -- future challenges. Progress in informatics : PI, 2:17--40, 2005.
|
| |
32
|
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, pages 80--83, 1945.
|
|