| Improving web spam classification using rank-time features |
| Full text |
Pdf
(173 KB)
|
| Source
|
AIRWeb; Vol. 215
archive
Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
table of contents
Banff, Alberta, Canada
SESSION: Temporal and topological factors
table of contents
Pages: 9 - 16
Year of Publication: 2007
ISBN:978-1-59593-732-2
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 32, Downloads (12 Months): 93, Citation Count: 4
|
|
|
ABSTRACT
In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of datset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show rank-time features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent rank-time features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
| |
2
|
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press, August 2006.
|
| |
3
|
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to Rank using Gradient Descent. Bonn, Germany, 2005.
|
 |
4
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
| |
5
|
B. Davison. Recognizing nepotistic links on the web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, 2000.
|
 |
6
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
7
|
|
| |
8
|
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.
|
| |
9
|
M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In Proc. of the 18th International Joint Conference on Artificial Intelligence, pages 1573--1579, 2003.
|
| |
10
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.
|
 |
11
|
|
| |
12
|
|
 |
13
|
Yi-Min Wang , Ming Ma , Yuan Niu , Hao Chen, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242612]
|
| |
14
|
B. Wu and B. Davison. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), May 2005.
|
 |
15
|
|
CITED BY 4
|
|
|
|
|
|
|
|
Yiqun Liu , Rongwei Cen , Min Zhang , Shaoping Ma , Liyun Ru, Identifying web spam with user behavior analysis, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
|
|
|
|
|