ACM Home Page
Please provide us with feedback. Feedback
Cleaning search results using term distance features
Full text PdfPdf (195 KB)
Source AIRWeb; Vol. 295 archive
Proceedings of the 4th international workshop on Adversarial information retrieval on the web table of contents
Beijing, China
SESSION: Text analysis table of contents
Pages 21-24  
Year of Publication: 2008
ISBN:978-1-60558-159-0
Authors
Josh Attenberg  Polytechnic University, Brooklyn, NY
Torsten Suel  Polytechnic University, Brooklyn, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 48,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1451983.1451989
What is a DOI?

ABSTRACT

The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram) properties of natural human text.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In Workshop on Advers. Inf. Retrieval on the Web, Aug. 2006.
3
 
4
A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In Workshop on Advers. Inf. Retrieval on the Web, 2005.
 
5
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, Yahoo! Research Barcelona, Nov. 2006.
 
6
B. Davison. Recognizing nepotistic links on the web. In Workshop on Artificial Intelligence for Web Search, 2000.
 
7
I. Dorst and T. Scheffer Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. European Conf. on Machine Learning, 2005.
8
 
9
 
10
 
11
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In Workshop on Advers. Inf. Retrieval on the Web, 2005.
 
12
 
13
 
14
15
 
16
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. of the 1st Int. Workshop on Adversarial Information Retrieval on the Web, pages 1--6, 2005.
17
18
19
 
20
B. Wu, V. Goel, and B. Davison. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust and the Web, 2006.


Collaborative Colleagues:
Josh Attenberg: colleagues
Torsten Suel: colleagues