| Web spam identification through language model analysis |
| Full text |
Pdf
(787 KB)
|
| Source
|
ACM International Conference Proceeding Series
archive
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
table of contents
Madrid, Spain
SESSION: Content analyis
table of contents
Pages 21-28
Year of Publication: 2009
ISBN:978-1-60558-438-6
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 22, Downloads (12 Months): 90, Citation Count: 0
|
|
|
ABSTRACT
This paper applies a language model approach to different sources of information extracted from a Web page, in order to provide high quality indicators in the detection of Web Spam. Two pages linked by a hyperlink should be topically related, even though this were a weak contextual relation. For this reason we have analysed different sources of information of a Web page that belongs to the context of a link and we have applied Kullback-Leibler divergence on them for characterising the relationship between two linked pages. Moreover, we combine some of these sources of information in order to obtain richer language models. Given the different nature of internal and external links, in our study we also distinguished these types of links getting a significant improvement in classification tasks. The result is a system that improves the detection of Web Spam on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb'06: Proceedings of the 2th international workshop on Adversarial information retrieval on the web, 2006.
|
 |
3
|
|
| |
4
|
A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
 |
5
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
 |
6
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
| |
7
|
|
| |
8
|
B. Davison. Recognizing nepotistic links on the web, 2000.
|
 |
9
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
10
|
|
| |
11
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the first International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
12
|
|
| |
13
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
 |
17
|
|
| |
18
|
|
 |
19
|
|
|