|
ABSTRACT
In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the roc curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
A. Aiken. MOSS: A system for detecting software plagiarism. Software, Department of Computer Science, University of California, Berkeley, http://www.cs.berkeley.edu/~aiken/moss.html, 1994.
|
| |
3
|
|
| |
4
|
|
 |
5
|
Bernhard E. Boser , Isabelle M. Guyon , Vladimir N. Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, p.144-152, July 27-29, 1992, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/130385.130401]
|
| |
6
|
L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801--849, 1998.
|
| |
7
|
M. Christodorescu and S. Jha. Static analysis of executables to detect malicious patterns. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association.
|
| |
8
|
W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, San Francisco, CA, 1995. Morgan Kaufmann.
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
13
|
E. Durning-Lawrence. Bacon is Shake-speare. The John McBride Company, New York, NY, 1910.
|
| |
14
|
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, CA, 1996. Morgan Kaufmann.
|
| |
15
|
A. Gray, P. Sallis, and S. MacDonell. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the Third Biannual Conference of the International Association of Forensic Linguists, pages 1--8, Birmingham, UK, 1997. International Association of Forensic Linguists.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 985--996, San Francisco, CA, 1995. Morgan Kaufmann.
|
| |
22
|
|
| |
23
|
I. Krsul. Authorship analysis: Identifying the author of a program. Master's thesis, Purdue University, West Lafayette, IN, 1994.
|
| |
24
|
I. Krsul and E. Spafford. Authorship analysis: Identifying the authors of a program. In Proceedings of the Eighteenth National Information Systems Security Conference, pages 514--524, Gaithersburg, MD, 1995. National Institute of Standards and Technology.
|
| |
25
|
R. Lo, K. Levitt, and R. Olsson. MCF: A malicious code filter. Computers & Security, 14:541--566, 1995.
|
 |
26
|
|
| |
27
|
|
| |
28
|
C. Metz, Y. Jiang, H. MacMahon, R. Nishikawa, and X. Pan. ROC software. Web page, Kurt Rossmann Laboratories for Radiologic Image Research, University of Chicago, Chicago, IL, 2003.
|
| |
29
|
P. Miller. hexdump 1.4. Software, http://gd.tuwien.ac.at/softeng/Aegis/hexdump.html, 1999.
|
| |
30
|
|
| |
31
|
D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169--198, 1999.
|
| |
32
|
|
| |
33
|
J. Platt. Probabilities for SV machines. In P. Bartlett, B. Schölkopf, D. Schuurmans, and A. Smola, editors, Advances in Large-Margin Classifiers, pages 61--74. MIT Press, Cambridge, MA, 2000.
|
| |
34
|
|
| |
35
|
|
| |
36
|
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, Menlo Park, CA, 1998. AAAI Press. Technical Report WS-98-05.
|
| |
37
|
|
| |
38
|
S. Soman, C. Krintz, and G. Vigna. Detecting malicious Java code using virtual machine auditing. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association.
|
| |
39
|
|
| |
40
|
J. Swets and R. Pickett. Evaluation of diagnostic systems: Methods from signal detection theory. Academic Press, New York, NY, 1982.
|
| |
41
|
G. Tesauro, J. Kephart, and G. Sorkin. Neural networks for computer virus recognition. IEEE Expert, 11:5--6, August 1996.
|
| |
42
|
|
| |
43
|
|
CITED BY 15
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yanfang Ye , Dingding Wang , Tao Li , Dongyi Ye, IMDS: intelligent malware detection system, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yanfang Ye , Tao Li , Qingshan Jiang , Zhixue Han , Li Wan, Intelligent file scoring system for malware detection from the gray list, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|
|
|
|