ACM Home Page
Please provide us with feedback. Feedback
Detecting code clones in binary executables
Full text PdfPdf (895 KB)
Source
International Symposium on Software Testing and Analysis archive
Proceedings of the eighteenth international symposium on Software testing and analysis table of contents
Chicago, IL, USA
SESSION: Testing and analysis tools #1 table of contents
Pages 117-128  
Year of Publication: 2009
ISBN:978-1-60558-338-9
Authors
Andreas Sæbjørnsen  University of California, Davis, Davis, CA, USA
Jeremiah Willcock  Indiana University, Bloomington, IN, USA
Thomas Panas  Lawrence Livermore National Laboratory, Livermore, CA, USA
Daniel Quinlan  Lawrence Livermore National Laboratory, Livermore, CA, USA
Zhendong Su  University of California, Davis, Davis, CA, USA
Sponsors
SIGSOFT: ACM Special Interest Group on Software Engineering
SIGPLAN: ACM Special Interest Group on Programming Languages
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 36,   Downloads (12 Months): 107,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1572272.1572287
What is a DOI?

ABSTRACT

Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual property. Because source code is often unavailable, especially for third-party software, finding duplicated code in binaries becomes particularly important. However, existing techniques operate primarily on source code, and no effective tool exists for binaries.

In this paper, we describe the first practical clone detection algorithm for binary executables. Our algorithm extends an existing tree similarity framework based on clustering of characteristic vectors of labeled trees with novel techniques to normalize assembly instructions and to accurately and compactly model their structural information. We have implemented our technique and evaluated it on Windows XP system binaries totaling over 50 million assembly instructions. Results show that it is both scalable and precise: it analyzed Windows XP system binaries in a few hours and produced few false positives. We believe our technique is a practical, enabling technology for many applications dealing with binary code.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
IDA Pro disassembler. http://www.datarescue.com.
 
2
JPlag. http://www.jplag.de.
 
3
A. Andoni and P. Indyk. E2LSH: Exact Euclidean locality-sensitive hashing. Web: http://www.mit.edu/~andoni/LSH/, 2004.
4
 
5
6
 
7
 
8
D. Bruschi, L. Martignoni, and M. Monga. Detecting self-mutating malware using control flow graph matching. In DIMVA, pages 129--143, 2006.
 
9
10
 
11
12
 
13
 
14
A. Hemel. The GPL compliance engineering guide. http://www.loohuis-consulting.nl/downloads/compliance-manual.pdf.
15
 
16
 
17
 
18
 
19
C. Kruegel, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Recent Adv. in Intrusion Detection, pages 207--226. Springer-Verlag, 2005.
 
20
21
 
22
M. Schordan and D. Quinlan. A source-to-source architecture for user-defined optimizations. In Joint Modular Languages Conference, volume 2789 of Lecture Notes in Computer Science, pages 214--223. Springer Verlag, Aug. 2003.
23
 
24
A. Schulman. Finding binary clones with opstrings and function digests. Doctor Dobb's J, 30(9):64--70, 2005.
 
25
 
26
27

Collaborative Colleagues:
Andreas Sæbjørnsen: colleagues
Jeremiah Willcock: colleagues
Thomas Panas: colleagues
Daniel Quinlan: colleagues
Zhendong Su: colleagues