ACM Home Page
Please provide us with feedback. Feedback
Information theoretic evaluation of change prediction models for large-scale software
Full text PdfPdf (308 KB)
Source International Conference on Software Engineering archive
Proceedings of the 2006 international workshop on Mining software repositories table of contents
Shanghai, China
SESSION: Defects table of contents
Pages: 126 - 132  
Year of Publication: 2006
ISBN:1-59593-397-2
Authors
Mina Askari  University of Waterloo, Waterloo, Canada
Ric Holt  University of Waterloo, Waterloo, Canada
Sponsors
ACM: Association for Computing Machinery
SIGSOFT: ACM Special Interest Group on Software Engineering
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 48,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1137983.1138013
What is a DOI?

ABSTRACT

In this paper, we analyze the data extracted from several open source software repositories. We observe that the change data follows a Zipf distribution. Based on the extracted data, we then develop three probabilistic models to predict which files will have changes or bugs. The first model is Maximum Likelihood Estimation (MLE), which simply counts the number of events, i.e., changes or bugs, that happen to each file and normalizes the counts to compute a probability distribution. The second model is Reflexive Exponential Decay (RED) in which we postulate that the predictive rate of modification in a file is incremented by any modification to that file and decays exponentially. The third model is called RED-Co-Change. With each modification to a given file, the RED-Co-Change model not only increments its predictive rate, but also increments the rate for other files that are related to the given file through previous co-changes. We then present an information-theoretic approach to evaluate the performance of different prediction models. In this approach, the closeness of model distribution to the actual unknown probability distribution of the system is measured using cross entropy. We evaluate our prediction models empirically using the proposed information-theoretic approach for six large open source systems. Based on this evaluation, we observe that of our three prediction models, the RED-Co-Change model predicts the distribution that is closest to the actual distribution for all the studied systems.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Allen, J. F. Using Entropy for Evaluating and Comparing Probability Distributions, available at: http://www.cs.rochester.edu/u/james/CSC248/Lec6.pdf
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
Khoshgoftaar, T. M., Allen, E. B., Jones, W. D., and Hudepohl, J. P. Data Mining for Predictors of Software Quality. International Journal of Software Engineering and Knowledge Engineering, 9(5), 1999.
 
11
 
12
 
13
 
14
 
15
Pareto Law: http://www.it-cortex.com/Pareto_law.htm
 
16
Perry, D. E. and Evangelist, W. M. An Empirical Study of Software Interface Faults - An Update. In Proceedings of the 20th Annual Hawaii International Conference on Systems Sciences, pages 113--136, Hawaii, USA, January 1987.
 
17
 
18
Reliability Analysis Center, Introduction to Software Reliability: A state of the Art Review. Reliability Analysis Center (RAC), 1996. http://rome.iitri.com/RAC/
 
19
 
20
Zipf, G. K. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.