|
ABSTRACT
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In AIRWeb, 2006.
|
| |
5
|
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In ACM WebKDD, Pennsylvania, USA, August 2006.
|
| |
6
|
A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In AIRWeb, 2006.
|
 |
7
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
| |
8
|
W. W. Cohen and Z. Kou. Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical report, 2006.
|
 |
9
|
André Luiz da Costa Carvalho , Paul - Alexandru Chirita , Edleno Silva de Moura , Pável Calado , Wolfgang Nejdl, Site level noise removal for search engines, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
[doi> 10.1145/1135777.1135793]
|
 |
10
|
|
| |
11
|
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In ECML, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005.
|
 |
12
|
|
| |
13
|
|
| |
14
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In AIRWeb, 2005.
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, 2003.
|
| |
20
|
S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In International Conference on Intelligence Analysis, 2005.
|
| |
21
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, 2005.
|
 |
22
|
|
| |
23
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
|
 |
24
|
|
| |
25
|
Guoyang Shen , Bin Gao , Tie-Yan Liu , Guang Feng , Shiji Song , Hang Li, Detecting Link Spam Using Temporal Information, Proceedings of the Sixth International Conference on Data Mining, p.1049-1053, December 18-22, 2006
[doi> 10.1109/ICDM.2006.51]
|
| |
26
|
|
| |
27
|
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In AIRWeb, 2005.
|
 |
28
|
|
| |
29
|
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In MTW, May 2006.
|
| |
30
|
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. Van Roy. Making eigenvector-based reputation systems robust to collusion. In WAW, volume 3243 of LNCS, pages 92--104, Rome, Italy, 2004. Springer.
|
 |
31
|
|
| |
32
|
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321--328, 2004.
|
CITED BY 19
|
|
|
|
|
Fabricio Benevenuto , Tiago Rodrigues , Virgilio Almeida , Jussara Almeida , Chao Zhang , Keith Ross, Identifying video spammers in online social networks, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carlos Castillo , Claudio Corsi , Debora Donato , Paolo Ferragina , Aristides Gionis, Query-log mining for detecting spam, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
Luca Becchetti , Paolo Boldi , Carlos Castillo , Aristides Gionis, Efficient semi-streaming algorithms for local triangle counting in massive graphs, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
Fabrício Benevenuto , Tiago Rodrigues , Virgílio Almeida , Jussara Almeida , Marcos Gonçalves, Detecting spammers and content promoters in online video social networks, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|
|
|
|
|
Daniel Hasan Dalip , Marcos André Gonçalves , Marco Cristo , Pável Calado, Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|
|
|
|
|
|
|