ACM Home Page
Please provide us with feedback. Feedback
Efficient token based clone detection with flexible tokenization
Full text PdfPdf (50 KB)
Source Foundations of Software Engineering archive
The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering: companion papers table of contents
Dubrovnik, Croatia
POSTER SESSION: ESEC/FSE'07 posters table of contents
Pages: 513 - 516  
Year of Publication: 2007
ISBN:978-1-59593-812-1
Authors
Hamid Abdul Basit  Lahore University of Management Sciences, Lahore, Pakistan
Simon J. Puglisi  Curtin University of Technology
William F. Smyth  McMaster University
Andrew Turpin  RMIT University
Stan Jarzabek  National University of Singapore, Singapore
Sponsors
ACM: Association for Computing Machinery
SIGSOFT: ACM Special Interest Group on Software Engineering
CEPIS : The Council of European Professional Informatics Societies
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 58,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1295014.1295029
What is a DOI?

ABSTRACT

Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous well-known tools.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Baker, B. S., "A Program for Identifying Duplicated Code", Computing Science and Statistics, vol. 24, 1992, pp. 49--57.
3
4
 
5
 
6
Bellon, S., "Vergleich von Techniken zur Erkennung duplizierten Quellcodes", Master's Thesis, Institut fur Softwaretechnologie, Universitat Stuttgart, Stuttgart, Germany, 2002.
 
7
 
8
9
 
10
 
11
 
12
Kolpakov, R., Bana, G., and Kucherov, G., "mreps: efficient and flexible detection of tandem repeats in DNA", Nucleic Acids Research, vol. 31(13), Oxford University Press, 2003, 3672--3678.
 
13
 
14
 
15
 
16
 
17
 
18
Linux Online, http://www.linux.org/, 2006.
 
19
 
20
21
 
22
XML-Based Variant Configuration Language- Technology for Reuse, http://xvcl.comp.nus.edu.sg, 2006.

Collaborative Colleagues:
Hamid Abdul Basit: colleagues
Simon J. Puglisi: colleagues
William F. Smyth: colleagues
Andrew Turpin: colleagues
Stan Jarzabek: colleagues