|
ABSTRACT
The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram) properties of natural human text.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
| |
2
|
L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In Workshop on Advers. Inf. Retrieval on the Web, Aug. 2006.
|
 |
3
|
András Benczúr , István Bíró , Károly Csalogány , Tamás Sarlós, Web spam detection via commercial intent analysis, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
[doi> 10.1145/1244408.1244424]
|
| |
4
|
A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In Workshop on Advers. Inf. Retrieval on the Web, 2005.
|
| |
5
|
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, Yahoo! Research Barcelona, Nov. 2006.
|
| |
6
|
B. Davison. Recognizing nepotistic links on the web. In Workshop on Artificial Intelligence for Web Search, 2000.
|
| |
7
|
I. Dorst and T. Scheffer Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. European Conf. on Machine Learning, 2005.
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In Workshop on Advers. Inf. Retrieval on the Web, 2005.
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. of the 1st Int. Workshop on Adversarial Information Retrieval on the Web, pages 1--6, 2005.
|
 |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
B. Wu, V. Goel, and B. Davison. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust and the Web, 2006.
|
|