| Predicting web spam with HTTP session information |
| Full text |
Pdf
(299 KB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceeding of the 17th ACM conference on Information and knowledge management
table of contents
Napa Valley, California, USA
SESSION: KM: information filtering
table of contents
Pages 339-348
Year of Publication: 2008
ISBN:978-1-59593-991-3
|
|
Authors
|
|
Steve Webb
|
Georgia Institute of Technology, Atlanta, GA, USA
|
|
James Caverlee
|
Texas A&M University, College Station, TX, USA
|
|
Calton Pu
|
Georgia Institute of Technology, Atlanta, GA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 41, Downloads (12 Months): 240, Citation Count: 0
|
|
|
ABSTRACT
Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research analyzes the contents of Web pages and the link structure of the Web graph. Unfortunately, these heavyweight approaches require full downloads of both legitimate and spam pages to be effective, making real-time deployment of these techniques infeasible for Web browsers, high-performance Web crawlers, and real-time Web applications. In this paper, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information (i.e., hosting IP addresses and HTTP session headers). Concretely, we built an HTTP session classifier based on our predictive technique, and by incorporating this classifier into HTTP retrieval operations, we are able to detect Web spam pages before the actual content transfer. As a result, our approach protects Web users from Web-propagated malware, and it generates significant bandwidth and storage savings. By applying our predictive technique to a corpus of almost 350,000 Web spam instances and almost 400,000 legitimate instances, we were able to successfully detect 88.2% of the Web spam pages with a false positive rate of only 0.4%. These classification results are superior to previous evaluation results obtained with traditional link-based and content-based techniques. Additionally, our experiments show that our approach saves an average of 15.4 KB of bandwidth and storage resources for every successfully identified Web spam page, while only adding an average of 101 microseconds to each HTTP retrieval operation. Therefore, our predictive technique can be successfully deployed in applications that demand real-time spam detection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
David S. Anderson , Chris Fleizach , Stefan Savage , Geoffrey M. Voelker, Spamscatter: characterizing internet scam hosting infrastructure, Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, p.1-14, August 06-10, 2007, Boston, MA
|
| |
2
|
L. Becchetti et al. Link-based characterization and detection of web spam. In Proc. of AIRWeb '06, 2006.
|
| |
3
|
A. A. Benczur et al. Spamrank - fully automatic link spam detection. In Proc. of AIRWeb '05, 2005.
|
 |
4
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
 |
5
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
 |
6
|
|
| |
7
|
J. Caverlee, S. Webb, and L. Liu. Spam-resilient web rankings via influence throttling. In Proc. of IPDPS '07, 2007.
|
| |
8
|
B. D. Davison. Recognizing nepotistic links on the web. In Proc. of AIWS '00, 2000.
|
| |
9
|
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. of ECML '05, 2005.
|
| |
10
|
|
 |
11
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. of AIRWeb '05, 2005.
|
| |
16
|
|
| |
17
|
|
| |
18
|
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. of IJCAI '95, 1995.
|
| |
19
|
|
| |
20
|
A. Moshchuk et al. A crawler-based study of spyware in the web. In Proc. of NDSS '06, 2006.
|
| |
21
|
Alexander Moshchuk , Tanya Bragin , Damien Deville , Steven D. Gribble , Henry M. Levy, SpyProxy: execution-based detection of malicious web content, Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, p.1-16, August 06-10, 2007, Boston, MA
|
 |
22
|
|
| |
23
|
Niels Provos , Dean McNamee , Panayiotis Mavrommatis , Ke Wang , Nagendra Modadugu, The ghost in the browser analysis of web-based malware, Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets, p.4-4, April 10, 2007, Cambridge, MA
|
 |
24
|
|
| |
25
|
Y. M. Wang et al. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In Proc. of NDSS '06, 2006.
|
| |
26
|
S. Webb, J. Caverlee, and C. Pu. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In Proc. of CEAS '06, 2006.
|
| |
27
|
S. Webb, J. Caverlee, and C. Pu. Characterizing web spam using content and http session analysis. In Proc. of CEAS '07, 2007.
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
|