ACM Home Page
Please provide us with feedback. Feedback
Magical thinking in data mining: lessons from CoIL challenge 2000
Full text PdfPdf (603 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Francisco, California
Pages: 426 - 431  
Year of Publication: 2001
ISBN:1-58113-391-X
Author
Charles Elkan  University of California, San Diego, La Jolla, California
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
AAAI : American Association for Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 45,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/502512.502576
What is a DOI?

ABSTRACT

CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very time-consuming model search. In either ease, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
G. Bateson. The logical categories of learning and communication. In Steps to an Ecology of Mind, pages 279-308. Ballantine Books, 1972.
 
2
 
3
C. Elkan. Boosting and naive Bayesian leaming. Technical Report CS97-557, Department of Computer Science and Engineering, University of California, San Diego, 1997.
4
 
5
G. Gigerenzer. Adaptive Thinking: Rationality in the Real WorM. Oxford University Press, 2000.
 
6
T. S. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, 1962.
 
7
S. D. Silvey. Statistical Inference. John Wiley & Sons, Inc., 1975. Reprinted with corrections.
 
8
CoIL challenge 2000: The insurance company case, Technical Report 2000-09, Leiden Institute of Advanced Computer Science, Netherlands, 2000. Available at www.wi, leidenuniv.nl/~putten/library/ cc2000/report .html.
 
9
R. Wirth and J. Hipp. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining (PADD '00), pages 29-39, 2000.
10
 
11

CITED BY  8