| Magical thinking in data mining: lessons from CoIL challenge 2000 |
| Full text |
Pdf
(603 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
San Francisco, California
Pages: 426 - 431
Year of Publication: 2001
ISBN:1-58113-391-X
|
|
Author
|
|
Charles Elkan
|
University of California, San Diego, La Jolla, California
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 45, Citation Count: 8
|
|
|
ABSTRACT
CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very time-consuming model search. In either ease, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
G. Bateson. The logical categories of learning and communication. In Steps to an Ecology of Mind, pages 279-308. Ballantine Books, 1972.
|
| |
2
|
|
| |
3
|
C. Elkan. Boosting and naive Bayesian leaming. Technical Report CS97-557, Department of Computer Science and Engineering, University of California, San Diego, 1997.
|
 |
4
|
|
| |
5
|
G. Gigerenzer. Adaptive Thinking: Rationality in the Real WorM. Oxford University Press, 2000.
|
| |
6
|
T. S. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, 1962.
|
| |
7
|
S. D. Silvey. Statistical Inference. John Wiley & Sons, Inc., 1975. Reprinted with corrections.
|
| |
8
|
CoIL challenge 2000: The insurance company case, Technical Report 2000-09, Leiden Institute of Advanced Computer Science, Netherlands, 2000. Available at www.wi, leidenuniv.nl/~putten/library/ cc2000/report .html.
|
| |
9
|
R. Wirth and J. Hipp. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining (PADD '00), pages 29-39, 2000.
|
 |
10
|
|
| |
11
|
|
CITED BY 8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Eamonn Keogh , Stefano Lonardi , Chotirat Ann Ratanamahatana , Li Wei , Sang-Hee Lee , John Handley, Compression-based data mining of sequential data, Data Mining and Knowledge Discovery, v.14 n.1, p.99-129, February 2007
|
|
|
|
|