ACM Home Page
Please provide us with feedback. Feedback
Can data transformation help in the detection of fault-prone modules?
Full text PdfPdf (205 KB)
Source International Symposium on Software Testing and Analysis archive
Proceedings of the 2008 workshop on Defects in large software systems table of contents
Seattle, Washington
SESSION: Technical papers table of contents
Pages 16-20  
Year of Publication: 2008
ISBN:978-1-60558-051-7
Authors
Yue Jiang  West Virginia University, Morgantown, WV
Bojan Cukic  West Virginia University, Morgantown, WV
Tim Menzies  West Virginia University, Morgantown, WV
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 50,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1390817.1390822
What is a DOI?

ABSTRACT

Data preprocessing (transformation) plays an important role in data mining and machine learning. In this study, we investigate the effect of four different preprocessing methods to fault-proneness prediction using nine datasets from NASA Metrics Data Programs (MDP) and ten classification algorithms. Our experiments indicate that log transformation rarely improves classification performance, but discretization affects the performance of many different algorithms. The impact of different transformations differs. Random forest algorithm, for example, performs better with original and log transformed data set. Boosting and NaiveBayes perform significantly better with discretized data. We conclude that no general benefit can be expected from data transformations. Instead, selected transformation techniques are recommended to boost the performance of specific classification algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The R Project for Statistical Computing, available http://www.r-project.org/.
 
2
Metric data program. NASA Independent Verification and Validation facility, Available from http://MDP.ivv.nasa.gov.
 
3
 
4
W. J. Conover. Practical Nonparametric Statistics. John Wiley and Sons, Inc., 1999.
 
5
 
6
J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning, pages 194--202, 1995.
 
7
J. J. Faraway. Practical Regression and Anova using R. online, July 2002.
 
8
U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, pages 1022--1027, 1993.
 
9
 
10
 
11
I. Jolliffe. Principal Component Analysis. Springer, New York, 2002.
 
12
 
13
S. Siegel. Nonparametric Satistics. New York: McGraw-Hill Book Company, Inc., 1956.
 
14


Collaborative Colleagues:
Yue Jiang: colleagues
Bojan Cukic: colleagues
Tim Menzies: colleagues