ACM Home Page
Please provide us with feedback. Feedback
Beyond blacklists: learning to detect malicious web sites from suspicious URLs
Full text MovMov (10:30),  PdfPdf (361 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Industrial track papers table of contents
Pages 1245-1254  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Justin Ma  UC San Diego, La Jolla, CA, USA
Lawrence K. Saul  UC San Diego, La Jolla, CA, USA
Stefan Savage  UC San Diego, La Jolla, CA, USA
Geoffrey M. Voelker  UC San Diego, La Jolla, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 47,   Downloads (12 Months): 124,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557153
What is a DOI?

ABSTRACT

Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Against Intuition. WOT Web of Trust. http://www.mywot.com.
 
3
 
4
A. Bergholz, J.-H. Chang, G. Paaß, F. Reichartz, and S. Strobel. Improved Phishing Detection using Model-Based Features. In Proceedings of the Conference on Email and Anti-Spam (CEAS), Mountain View, CA, Aug. 2008.
 
5
 
6
C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm/.
 
7
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification. http://www.csie.ntu.edu.tw/ cjlin/liblinear/.
8
9
 
10
Google. Google Toolbar. http://tools.google.com/firefox/toolbar/.
 
11
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Publishing Company, New York, NY, 2001.
 
12
IronPort. IronPort Web Reputation: Protect and Defend Against URL-Based Threat. IronPort White Paper, 2008.
 
13
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, Stanford, CA, Mar. 2006.
14
 
15
McAfee. SiteAdvisor. http://www.siteadvisor.com.
 
16
 
17
 
18
A. Moshchuk, T. Bragin, S. D. Gribble, and H. M. Levy. A Crawler-Based Study of Spyware on the Web. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Feb. 2006.
 
19
Netscape. DMOZ Open Directory Project. http://www.dmoz.org.
 
20
Y. Niu, Y.-M. Wang, H. Chen, M. Ma, and F. Hsu. A Quantitative Study of Forum Spamming Using Context-based Analysis. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Mar. 2007.
 
21
OpenDNS. PhishTank. http://www.phishtank.com.
 
22
 
23
 
24
F. Sha, A. Park, and L. K. Saul. Multiplicative Updates for L_1-Regularized Linear and Logistic Regression. In Proceedings of the Symposium on Intelligent Data Analysis (IDA), Ljubljana, Slovenia, Sept. 2007.
 
25
Y.-M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Feb. 2006.
 
26
WebSense. ThreatSeeker Network. http://www.websense.com/content/Threatseeker.aspx.
 
27
28


Collaborative Colleagues:
Justin Ma: colleagues
Lawrence K. Saul: colleagues
Stefan Savage: colleagues
Geoffrey M. Voelker: colleagues