|
ABSTRACT
We propose a number of features for Web spam filtering based on the occurrence of keywords that are either of high advertisement value or highly spammed. Our features include popular words from search engine query logs as well as high cost or volume words according to Google AdWords. We also demonstrate the spam filtering power of the Online Commercial Intention (OCI) value assigned to an URL in a Microsoft adCenter Labs Demonstration and the Yahoo! Mindset classification of Web pages as either commercial or non-commercial as well as metrics based on the occurrence of Google ads on the page. We run our tests on the WEBSPAM-UK2006 dataset recently compiled by Castillo et al. as a standard means of measuring the performance of Web spam detection algorithms. Our features improve the classification accuracy of the publicly available WEBSPAM-UK2006 features by 3%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. A. Benczúr, K. Csalogány, E. Friedman, D. Fogars, T. Sarlós, M. Uher, and E. Windhager. Searching a small national domain---preliminary report. In Proc. WWW, 2003.
|
| |
2
|
A. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In Proc. AIRWeb, 2006.
|
| |
3
|
A. A. Benczúr, K. Csalogány, and T. Sarlós, and M. Uher. SpamRank -- Fully automatic link spam detection. In Proc. AIRWeb, 2005.
|
 |
4
|
|
 |
5
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
| |
6
|
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. DELIS Technical report TR-0458, 2006.
|
| |
7
|
K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetaizability. In Proc. AIRWeb, pages 17--24, 2006.
|
 |
8
|
Honghua (Kathy) Dai , Lingzhi Zhao , Zaiqing Nie , Ji-Rong Wen , Lee Wang , Ying Li, Detecting online commercial intention (OCI), Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
[doi> 10.1145/1135777.1135902]
|
| |
9
|
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. ECML, volume 3720 of LNAI, pages 233--243, 2005.
|
 |
10
|
|
 |
11
|
Ronald Fagin , Ravi Kumar , Kevin S. McCurley , Jasmine Novak , D. Sivakumar , John A. Tomlin , David P. Williamson, Searching the workplace web, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
[doi> 10.1145/775152.775204]
|
 |
12
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
13
|
|
| |
14
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. AIRWeb, 2005.
|
| |
15
|
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with Trust Rank. In Proc. VLDB, pages 576--587, 2004.
|
 |
16
|
|
 |
17
|
|
 |
18
|
Yi-Min Wang , Ming Ma , Yuan Niu , Hao Chen, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242612]
|
| |
19
|
|
| |
20
|
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Workshop on Models of Trust for the Web, 2006.
|
|