ACM Home Page
Please provide us with feedback. Feedback
Web spam identification through language model analysis
Full text PdfPdf (787 KB)
Source ACM International Conference Proceeding Series archive
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web table of contents
Madrid, Spain
SESSION: Content analyis table of contents
Pages 21-28  
Year of Publication: 2009
ISBN:978-1-60558-438-6
Authors
Juan Martinez-Romo  UNED, Madrid, Spain
Lourdes Araujo  UNED, Madrid, Spain
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 90,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1531914.1531920
What is a DOI?

ABSTRACT

This paper applies a language model approach to different sources of information extracted from a Web page, in order to provide high quality indicators in the detection of Web Spam. Two pages linked by a hyperlink should be topically related, even though this were a weak contextual relation. For this reason we have analysed different sources of information of a Web page that belongs to the context of a link and we have applied Kullback-Leibler divergence on them for characterising the relationship between two linked pages. Moreover, we combine some of these sources of information in order to obtain richer language models. Given the different nature of internal and external links, in our study we also distinguished these types of links getting a significant improvement in classification tasks. The result is a system that improves the detection of Web Spam on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb'06: Proceedings of the 2th international workshop on Adversarial information retrieval on the web, 2006.
3
 
4
A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
5
6
 
7
 
8
B. Davison. Recognizing nepotistic links on the web, 2000.
9
 
10
 
11
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the first International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
 
12
 
13
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
14
15
16
17
 
18
19

Collaborative Colleagues:
Juan Martinez-Romo: colleagues
Lourdes Araujo: colleagues