ACM Home Page
Please provide us with feedback. Feedback
Learning to detect malicious executables in the wild
Full text PdfPdf (217 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
SESSION: Industry/government track papers table of contents
Pages: 470 - 478  
Year of Publication: 2004
ISBN:1-58113-888-1
Authors
Jeremy Z. Kolter  Georgetown University, Washington, DC
Marcus A. Maloof  Georgetown University, Washington, DC
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 120,   Citation Count: 15
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1014105
What is a DOI?

ABSTRACT

In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the roc curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
A. Aiken. MOSS: A system for detecting software plagiarism. Software, Department of Computer Science, University of California, Berkeley, http://www.cs.berkeley.edu/~aiken/moss.html, 1994.
 
3
 
4
5
 
6
L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801--849, 1998.
 
7
M. Christodorescu and S. Jha. Static analysis of executables to detect malicious patterns. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association.
 
8
W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, San Francisco, CA, 1995. Morgan Kaufmann.
 
9
 
10
11
12
 
13
E. Durning-Lawrence. Bacon is Shake-speare. The John McBride Company, New York, NY, 1910.
 
14
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, CA, 1996. Morgan Kaufmann.
 
15
A. Gray, P. Sallis, and S. MacDonell. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the Third Biannual Conference of the International Association of Forensic Linguists, pages 1--8, Birmingham, UK, 1997. International Association of Forensic Linguists.
 
16
 
17
 
18
 
19
 
20
 
21
J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 985--996, San Francisco, CA, 1995. Morgan Kaufmann.
 
22
 
23
I. Krsul. Authorship analysis: Identifying the author of a program. Master's thesis, Purdue University, West Lafayette, IN, 1994.
 
24
I. Krsul and E. Spafford. Authorship analysis: Identifying the authors of a program. In Proceedings of the Eighteenth National Information Systems Security Conference, pages 514--524, Gaithersburg, MD, 1995. National Institute of Standards and Technology.
 
25
R. Lo, K. Levitt, and R. Olsson. MCF: A malicious code filter. Computers & Security, 14:541--566, 1995.
26
 
27
 
28
C. Metz, Y. Jiang, H. MacMahon, R. Nishikawa, and X. Pan. ROC software. Web page, Kurt Rossmann Laboratories for Radiologic Image Research, University of Chicago, Chicago, IL, 2003.
 
29
P. Miller. hexdump 1.4. Software, http://gd.tuwien.ac.at/softeng/Aegis/hexdump.html, 1999.
 
30
 
31
D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169--198, 1999.
 
32
 
33
J. Platt. Probabilities for SV machines. In P. Bartlett, B. Schölkopf, D. Schuurmans, and A. Smola, editors, Advances in Large-Margin Classifiers, pages 61--74. MIT Press, Cambridge, MA, 2000.
 
34
 
35
 
36
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, Menlo Park, CA, 1998. AAAI Press. Technical Report WS-98-05.
 
37
 
38
S. Soman, C. Krintz, and G. Vigna. Detecting malicious Java code using virtual machine auditing. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association.
 
39
 
40
J. Swets and R. Pickett. Evaluation of diagnostic systems: Methods from signal detection theory. Academic Press, New York, NY, 1982.
 
41
G. Tesauro, J. Kephart, and G. Sorkin. Neural networks for computer virus recognition. IEEE Expert, 11:5--6, August 1996.
 
42
 
43

CITED BY  15

Collaborative Colleagues:
Jeremy Z. Kolter: colleagues
Marcus A. Maloof: colleagues