| What's the code?: automatic classification of source code archives |
| Full text |
Pdf
(759 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Edmonton, Alberta, Canada
POSTER SESSION: Poster papers
table of contents
Pages: 632 - 638
Year of Publication: 2002
ISBN:1-58113-567-X
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 6, Downloads (12 Months): 38, Citation Count: 9
|
|
|
ABSTRACT
There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Abramson, N. "Information Theory and Coding." McGraw- Hill, New York, 1963.
|
 |
2
|
|
| |
3
|
Chang, C. and Lin, C. "LIBSVM: A library for support vector machines." Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
|
| |
4
|
Chen, A., Lee Y. K., Yao A. Y., and Michail A. "Code search based on CVS comments: A preliminary evaluation," (Technical Report 0106). School of Computer Science and Eng., University of New South Wales, Australia, 2001.
|
| |
5
|
Creps, R. G., Simos, M. A., and Prieto-Diaz R. "The STARS conceptual framework for reuse processes, software technology for adaptable, reliable systems (STARS)" (Technical Report). DARPA, 1992.
|
| |
6
|
Dumais, S. T. "Using SVMs for text categorization." IEEE Intelligent Systems Magazine, Trends and Controversies, 13(4):21--23, 1998.
|
 |
7
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
8
|
|
| |
9
|
Eric J. Glover , Gary W. Flake , Steve Lawrence , Andries Kruger , David M. Pennock , William P. Birmingham , C. Lee Giles, Improving Category Specific Web Search by Learning Query Modifications, Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001), p.23, January 08-12, 2001
|
| |
10
|
|
| |
11
|
|
| |
12
|
Knerr, S., Personnaz, L., and Dreyfus, G. "Single layer learning revisited: a stepwise procedure for building and training a neural network." Neurocomputing: Algorithms, Architectures and Applications. J. Fogelman (Ed.), Springer-Verlag, 1990.
|
| |
13
|
|
 |
14
|
|
| |
15
|
Kwok J. T. "Automated text categorization using support vector machines." Proc. of the International Conference on Neural Information Processing, 347--351, 1999.
|
| |
16
|
Merkl, D. "Content-based software classification by self-organization." Proc. of the IEEE International Conference on Neural Networks, 1086--1091, 1995.
|
| |
17
|
Platt, J. C., Cristianini, N., and Shawe-Taylor, J. "Large margin DAGs for multiclass classification." Advances in Neural Information Processing Systems 12, 547--553. MIT Press, 2000.
|
 |
18
|
|
| |
19
|
|
CITED BY 9
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Erik Linstead , Paul Rigor , Sushil Bajracharya , Cristina Lopes , Pierre Baldi, Mining concepts from code with probabilistic topic models, Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, November 05-09, 2007, Atlanta, Georgia, USA
|
|
|
|
|
|
|
|
|
Erik Linstead , Sushil Bajracharya , Trung Ngo , Paul Rigor , Cristina Lopes , Pierre Baldi, Sourcerer: mining and searching internet-scale software repositories, Data Mining and Knowledge Discovery, v.18 n.2, p.300-336, April 2009
|
|