|
ABSTRACT
This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bergholz, A., Chang, J.-H., Paaß, G., Reichartz, F., & Strobel, S. (2008). Improved Phishing Detection using Model-Based Features. Proceedings of the Conference on Email and Anti-Spam (CEAS). Mountain View, CA.
|
| |
2
|
|
| |
3
|
Bottou, L., & LeCun, Y. (2004). Large Scale Online Learning. In S. Thrun, L. K. Saul and B. Schöölkopf (Eds.), Advances in Neural Information Processing Systems 16, 217--224. Cambridge, MA: MIT Press.
|
| |
4
|
Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., & Mitchell, J. C. (2004). Client-Side Defense against Web-Based Identity Theft. Network and Distributed System Security (NDSS). San Diego, CA.
|
| |
5
|
|
| |
6
|
Crammer, K., Dredze, M., & Pereira, F. (2009). Exact Convex Confidence-Weighted Learning. Advances in Neural Information Processing Systems 21 (pp. 345--352).
|
| |
7
|
|
 |
8
|
|
| |
9
|
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A Library for Large Linear Classification. http://www.csie.ntu.edu.tw/cjlin/liblinear/.
|
 |
10
|
|
 |
11
|
Sujata Garera , Niels Provos , Monica Chew , Aviel D. Rubin, A framework for detection and measurement of phishing attacks, Proceedings of the 2007 ACM workshop on Recurring malcode, November 02-02, 2007, Alexandria, Virginia, USA
[doi> 10.1145/1314389.1314391]
|
 |
12
|
|
| |
13
|
|
| |
14
|
Moshchuk, A., Bragin, T., Gribble, S. D., & Levy, H. M. (2006). A Crawler-Based Study of Spyware on the Web. Network and Distributed System Security (NDSS). San Diego, CA.
|
 |
15
|
|
| |
16
|
Niels Provos , Panayiotis Mavrommatis , Moheeb Abu Rajab , Fabian Monrose, All your iFRAMEs point to Us, Proceedings of the 17th conference on Security symposium, p.1-15, July 28-August 01, 2008, San Jose, CA
|
| |
17
|
Niels Provos , Dean McNamee , Panayiotis Mavrommatis , Ke Wang , Nagendra Modadugu, The ghost in the browser analysis of web-based malware, Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets, p.4-4, April 10, 2007, Cambridge, MA
|
| |
18
|
Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65, 386--408.
|
| |
19
|
Rudd, J. (2007). Botnet plugin for SpamAssas-sin. http://people.ucsc.edu/~jrudd/spamassassin/.
|
| |
20
|
Sinha, S., Bailey, M., & Jahanian, F. (2008). Shades of Grey: On the Effectiveness of Reputation-Based Blacklists. Proceedings of the International Conference on Malicious and Unwanted Software (Malware) (pp. 57--64). Alexandria, VA.
|
| |
21
|
Sonnenburg, S., Franc, V., Yom-Tov, E., & Sebag, M. (2008). PASCAL Large Scale Learning Challenge. http://largescale.first.fraunhofer.de/workshop/.
|
| |
22
|
Wang, Y.-M., Beck, D., Jiang, X., Roussev, R., Verbowski, C., Chen, S., & King, S. (2006). Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. Network and Distributed System Security (NDSS). San Diego, CA.
|
|