|
ABSTRACT
Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other criteria. For example, SVMs and boosting are designed to optimize accuracy, whereas neural nets typically optimize squared error or cross entropy. We conducted an empirical study using a variety of learning methods (SVMs, neural nets, k-nearest neighbor, bagged and boosted trees, and boosted stumps) to compare nine boolean classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve, Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold. The three metrics that are appropriate when predictions are interpreted as probabilities: squared error, cross entropy, and calibration, lay in one part of metric space far away from metrics that depend on the relative order of the predicted values: ROC area, average precision, break-even point, and lift. In between them fall two metrics that depend on comparing predictions to a threshold: accuracy and F-score. As expected, maximum margin methods such as SVMs and boosted trees have excellent performance on metrics like accuracy, but perform poorly on probability metrics such as squared error. What was not expected was that the margin methods have excellent performance on ordering metrics such as ROC area and average precision. We introduce a new metric, SAR, that combines squared error, accuracy, and ROC area into one metric. MDS and correlation analysis shows that SAR is centrally located and correlates well with other metrics, suggesting that it is a good general purpose metric to use when more specific criteria are not known.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C. Blake and C. Merz. UCI repository of machine learning databases, 1998.
|
| |
2
|
M. DeGroot and S. Fienberg. The comparison and evaluation of forecasters. Statistician, 32(1):12--22, 1982.
|
| |
3
|
P. Giudici. Applied Data Mining. John Wiley and Sons, New York, 2003.
|
| |
4
|
A. Gualtieri, S. R. Chettri, R. Cromp, and L. Johnson. Support vector machine classifiers as applied to aviris data. In Proc. Eighth JPL Airborne Geoscience Workshop, 1999.
|
| |
5
|
T. Joachims. Making large-scale svm learning practical. In Advances in Kernel Methods, 1999.
|
| |
6
|
R. King, C. Feng, and A. Shutherland. Statlog: comparison of classification algorithms on large real-world problems. Applied Artificial Intelligence, 9(3):259--287, May/June 1995.
|
| |
7
|
P.A. Flach. The geometry of roc space: understanding machine learning metrics through roc isometrics. In Proc. 20th International Conference on Machine Learning (ICML'03), pages 194--201. AAAI Press, January 2003.
|
| |
8
|
J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--74, 1999.
|
| |
9
|
|
| |
10
|
F. J. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Knowledge Discovery and Data Mining, pages 43--48, 1997.
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
Clifton Phua , Vincent Lee , Kate Smith-Miles , Ross Gayler, Adaptive communal detection in search of adversarial identity crime, Proceedings of the 2007 international workshop on Domain driven data mining, p.1-10, August 12-12, 2007, San Jose, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
INDEX TERMS
Primary Classification:
I.
Computing Methodologies
I.5
PATTERN RECOGNITION
I.5.2
Design Methodology
Subjects:
Classifier design and evaluation
General Terms:
Algorithms,
Experimentation,
Measurement,
Performance
Keywords:
ROC,
cross entropy,
lift,
metrics,
performance evaluation,
precision,
recall,
supervised learning
|