ACM Home Page
Please provide us with feedback. Feedback
Predicting web spam with HTTP session information
Full text PdfPdf (299 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: KM: information filtering table of contents
Pages 339-348  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Steve Webb  Georgia Institute of Technology, Atlanta, GA, USA
James Caverlee  Texas A&M University, College Station, TX, USA
Calton Pu  Georgia Institute of Technology, Atlanta, GA, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 41,   Downloads (12 Months): 240,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458129
What is a DOI?

ABSTRACT

Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research analyzes the contents of Web pages and the link structure of the Web graph. Unfortunately, these heavyweight approaches require full downloads of both legitimate and spam pages to be effective, making real-time deployment of these techniques infeasible for Web browsers, high-performance Web crawlers, and real-time Web applications. In this paper, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information (i.e., hosting IP addresses and HTTP session headers). Concretely, we built an HTTP session classifier based on our predictive technique, and by incorporating this classifier into HTTP retrieval operations, we are able to detect Web spam pages before the actual content transfer. As a result, our approach protects Web users from Web-propagated malware, and it generates significant bandwidth and storage savings. By applying our predictive technique to a corpus of almost 350,000 Web spam instances and almost 400,000 legitimate instances, we were able to successfully detect 88.2% of the Web spam pages with a false positive rate of only 0.4%. These classification results are superior to previous evaluation results obtained with traditional link-based and content-based techniques. Additionally, our experiments show that our approach saves an average of 15.4 KB of bandwidth and storage resources for every successfully identified Web spam page, while only adding an average of 101 microseconds to each HTTP retrieval operation. Therefore, our predictive technique can be successfully deployed in applications that demand real-time spam detection.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
L. Becchetti et al. Link-based characterization and detection of web spam. In Proc. of AIRWeb '06, 2006.
 
3
A. A. Benczur et al. Spamrank - fully automatic link spam detection. In Proc. of AIRWeb '05, 2005.
4
5
6
 
7
J. Caverlee, S. Webb, and L. Liu. Spam-resilient web rankings via influence throttling. In Proc. of IPDPS '07, 2007.
 
8
B. D. Davison. Recognizing nepotistic links on the web. In Proc. of AIWS '00, 2000.
 
9
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. of ECML '05, 2005.
 
10
11
 
12
 
13
 
14
 
15
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. of AIRWeb '05, 2005.
 
16
 
17
 
18
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. of IJCAI '95, 1995.
 
19
 
20
A. Moshchuk et al. A crawler-based study of spyware in the web. In Proc. of NDSS '06, 2006.
 
21
22
 
23
24
 
25
Y. M. Wang et al. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In Proc. of NDSS '06, 2006.
 
26
S. Webb, J. Caverlee, and C. Pu. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In Proc. of CEAS '06, 2006.
 
27
S. Webb, J. Caverlee, and C. Pu. Characterizing web spam using content and http session analysis. In Proc. of CEAS '07, 2007.
 
28
29
 
30

Collaborative Colleagues:
Steve Webb: colleagues
James Caverlee: colleagues
Calton Pu: colleagues