ACM Home Page
Please provide us with feedback. Feedback
What's the code?: automatic classification of source code archives
Full text PdfPdf (759 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
POSTER SESSION: Poster papers table of contents
Pages: 632 - 638  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Secil Ugurel  The Pennsylvania State University, University Park, PA
Robert Krovetz  NEC Research Institute, Princeton, NJ
C. Lee Giles  The Pennsylvania State University, University Park, PA
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 38,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775141
What is a DOI?

ABSTRACT

There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Abramson, N. "Information Theory and Coding." McGraw- Hill, New York, 1963.
2
 
3
Chang, C. and Lin, C. "LIBSVM: A library for support vector machines." Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
 
4
Chen, A., Lee Y. K., Yao A. Y., and Michail A. "Code search based on CVS comments: A preliminary evaluation," (Technical Report 0106). School of Computer Science and Eng., University of New South Wales, Australia, 2001.
 
5
Creps, R. G., Simos, M. A., and Prieto-Diaz R. "The STARS conceptual framework for reuse processes, software technology for adaptable, reliable systems (STARS)" (Technical Report). DARPA, 1992.
 
6
Dumais, S. T. "Using SVMs for text categorization." IEEE Intelligent Systems Magazine, Trends and Controversies, 13(4):21--23, 1998.
7
 
8
 
9
 
10
 
11
 
12
Knerr, S., Personnaz, L., and Dreyfus, G. "Single layer learning revisited: a stepwise procedure for building and training a neural network." Neurocomputing: Algorithms, Architectures and Applications. J. Fogelman (Ed.), Springer-Verlag, 1990.
 
13
14
 
15
Kwok J. T. "Automated text categorization using support vector machines." Proc. of the International Conference on Neural Information Processing, 347--351, 1999.
 
16
Merkl, D. "Content-based software classification by self-organization." Proc. of the IEEE International Conference on Neural Networks, 1086--1091, 1995.
 
17
Platt, J. C., Cristianini, N., and Shawe-Taylor, J. "Large margin DAGs for multiclass classification." Advances in Neural Information Processing Systems 12, 547--553. MIT Press, 2000.
18
 
19

CITED BY  9

Collaborative Colleagues:
Secil Ugurel: colleagues
Robert Krovetz: colleagues
C. Lee Giles: colleagues