ACM Home Page
Please provide us with feedback. Feedback
Improving web spam classification using rank-time features
Full text PdfPdf (173 KB)
Source AIRWeb; Vol. 215 archive
Proceedings of the 3rd international workshop on Adversarial information retrieval on the web table of contents
Banff, Alberta, Canada
SESSION: Temporal and topological factors table of contents
Pages: 9 - 16  
Year of Publication: 2007
ISBN:978-1-59593-732-2
Authors
Krysta M. Svore  Microsoft Research, Redmond, WA
Qiang Wu  Microsoft Research, Redmond, WA
Chris J. C. Burges  Microsoft Research, Redmond, WA
Aaswath Raman  Microsoft Redmond, WA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 79,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1244408.1244411
What is a DOI?

ABSTRACT

In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of datset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show rank-time features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent rank-time features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press, August 2006.
 
3
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to Rank using Gradient Descent. Bonn, Germany, 2005.
4
 
5
B. Davison. Recognizing nepotistic links on the web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, 2000.
6
 
7
 
8
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.
 
9
M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In Proc. of the 18th International Joint Conference on Artificial Intelligence, pages 1573--1579, 2003.
 
10
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.
11
 
12
13
 
14
B. Wu and B. Davison. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), May 2005.
15


Collaborative Colleagues:
Krysta M. Svore: colleagues
Qiang Wu: colleagues
Chris J. C. Burges: colleagues
Aaswath Raman: colleagues