ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Nearest neighbor sampling for better defect prediction
Full text PdfPdf (60 KB)
Source ACM SIGSOFT Software Engineering Notes archive
Volume 30 ,  Issue 4  (July 2005) table of contents
SESSION: Predictor Models in Software Engineering (PROMISE) table of contents
Pages: 1 - 6  
Year of Publication: 2005
ISSN:0163-5948
Also published in ...
Author
Gary D. Boetticher  University of Houston - Clear Lake, Houston, Texas
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 38,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1082983.1083173
What is a DOI?

ABSTRACT

An important step in building effective predictive models applies one or more sampling techniques. Traditional sampling techniques include random, stratified, systemic, and clustered. The problem with these techniques is that they focus on the class attribute, rather than the non-class attributes. For example, if a test instance's nearest neighbor is from the opposite class of the training set, then it seems doomed to misclassification. To illustrate this problem, this paper conducts 20 experiments on five different NASA defect datasets (CM1, JM1, KC1, KC2, PC1) using two different learners (J48 and Naïve Bayes). Each data set is divided into 3 groups, a training set, and "nice/nasty" neighbor test sets. Using a nearest neighbor approach, "Nice neighbors" consist of those test instances closest to class training instances. "Nasty neighbors" are closest to opposite class training instances. The "Nice" experiments average 94 percent accuracy and the "Nasty" experiments average 20 percent accuracy. Based on these results a new nearest neighbor sampling technique is proposed.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Boetticher, G., Data Mining Software Tools (2005), CSCI5833 Data Mining Tools and Techniques, Department of Computer Science, University of Houston - Clear Lake, Houston, Texas. Available at: <u>http://nas.cl.uh.edu/boetticher/CSCI5931%20Data%20Mining.html</u>
 
2
Jorgensen, M., "A review of studies on Expert Estimation of Software Development Effort," Journal of Systems and Software, 70 (1--2), Pp. 37--60, 2004.
 
3
KDD Nuggets Website, Polls: Data Mining Tools You Regularly Use, Knowledge Discovery and Data Mining Poll, <u>http://www.kdnuggets.com/polls/data mining tools 2002 june2.htm</u>
 
4
Khoshgoftaar, T. M., and E. B. Allen, "Model software quality with classification trees," in Recent Advances in Reliability and Quality Engineering, H. Pham, Ed. 2001, pp. 247--270, World Scientific.
 
5
Menzies, Tim, Personal Conversation, February 2, 2005.
 
6
Menzies, T., Raffo D., Setamanit, S., DiStefano, J., Chapman, R., Why Mine Repositories, Submitted to: Transactions on Software Engineering, 2005.
 
7
 
8
 
9
Rish, I., An empirical study of the naive Bayes classifier, T. J. Watson Center, IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence, Seattle, 2001.
 
10
Shirabad, J. S., and Menzies, T. J. (2005) The PROMISE Repository of Software Engineering Databases School of Information Technology and Engineering, University of Ottawa, Canada Available: <u>http://promise.site.uottawa.ca/SERepository</u>
 
11
 
12
 
13