|
ABSTRACT
We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
Baeza-Yates, R., Castillo, C., and López, V. 2005. Pagerank increase under different collusion topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web.
|
| |
5
|
|
| |
6
|
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006a. Link-based characterization and detection of Web Spam. In 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). Seattle, WA.
|
| |
7
|
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006b. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press.
|
| |
8
|
Benczúr, A. A., Csalogány, K., Sarlós, T., and Uher, M. 2005. Spamrank: Fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan.
|
| |
9
|
|
| |
10
|
|
| |
11
|
Broder, A. and Mitzenmacher, M. 2003. Network applications of Bloom filters: A survey. Internet Math. 1, 4, 485--509.
|
 |
12
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
 |
13
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
 |
14
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
| |
15
|
|
| |
16
|
Costa, L., Rodrigues, F. A., Travieso, G., and Villas. 2005. Characterization of complex networks: A survey of measurements. URL: http://arxiv.org/abs/cond-mat/0505185.
|
 |
17
|
André Luiz da Costa Carvalho , Paul - Alexandru Chirita , Edleno Silva de Moura , Pável Calado , Wolfgang Nejdl, Site level noise removal for search engines, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
[doi> 10.1145/1135777.1135793]
|
| |
18
|
Davison, B. D. 2000a. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search. AAAI Press, TX, 23--28.
|
 |
19
|
|
 |
20
|
|
| |
21
|
Drost, I. and Scheffer, T. 2005. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceedings of the 16th European Conference on Machine Learning (ECML). Lecture Notes in Artificial Intelligence, vol. 3720. Springer, 233--243.
|
| |
22
|
Durand, M. and Flajolet, P. 2003. Loglog counting of large cardinalities (extended abstract). In Proceedings of 11th Annual European Symposium on Algorithms. Lecture Notes in Computer Science, vol. 2832. Springer, 605--617.
|
 |
23
|
|
| |
24
|
Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J. 2004. On graph problems in a semi-streaming model. In 31st International Colloquium on Automata, Languages and Programming.
|
 |
25
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
26
|
|
| |
27
|
|
| |
28
|
Gomes, L. H., Almeida, R. B., Bettencourt, L. M. A., Almeida, V., and Almeida, J. M. 2005. Comparative graph theoretical characterization of networks of spam and legitimate email. URL: http://www.ceas.cc/papers-2005/131.pdf.
|
 |
29
|
|
 |
30
|
|
| |
31
|
Gupta, S., Anderson, R. M., and May, R. M. 1989. Networks of sexual contacts: implications for the pattern of spread of hiv. AIDS 3, 12, 807--817.
|
| |
32
|
|
| |
33
|
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.
|
| |
34
|
|
| |
35
|
Haveliwala, T. 1999. Efficient computation of pagerank. Tech. rep., Stanford University.
|
| |
36
|
|
 |
37
|
Jure Leskovec , Jon Kleinberg , Christos Faloutsos, Graphs over time: densification laws, shrinking diameters and possible explanations, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081893]
|
| |
38
|
|
| |
39
|
Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the International Conference on Machine Learning.
|
| |
40
|
|
 |
41
|
|
 |
42
|
|
| |
43
|
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.
|
 |
44
|
|
| |
45
|
Perkins, A. 2001. The classification of search engine spam. http://www.silverdisc.co.uk/articles/spam-classification/.
|
 |
46
|
|
| |
47
|
Guoyang Shen , Bin Gao , Tie-Yan Liu , Guang Feng , Shiji Song , Hang Li, Detecting Link Spam Using Temporal Information, Proceedings of the Sixth International Conference on Data Mining, p.1049-1053, December 18-22, 2006
[doi> 10.1109/ICDM.2006.51]
|
 |
48
|
|
| |
49
|
|
 |
50
|
|
| |
51
|
Zhang, H., Goel, A., Govindan, R., Mason, K., and Van Roy, B. 2004. Making eigenvector-based reputation systems robust to collusion. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 92--104.
|
 |
52
|
|
|