|
ABSTRACT
Overall performance of the data mining process depends not just on the value of the induced knowledge but also on various costs of the process itself such as the cost of acquiring and pre-processing training examples, the CPU cost of model induction, and the cost of committed errors. Recently, several progressive sampling strategies for maximizing the overall data mining utility have been proposed. All these strategies are based on repeated acquisitions of additional training examples until a utility decrease is observed. In this paper, we present an alternative, projective sampling strategy, which fits functions to a partial learning curve and a partial run-time curve obtained from a small subset of potentially available data and then uses these projected functions to analytically estimate the optimal training set size. The proposed approach is evaluated on a variety of benchmark datasets using the RapidMiner environment for machine learning and data mining processes. The results show that the learning and run-time curves projected from only several data points can lead to a cheaper data mining process than the common progressive sampling methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Asuncion, A.&Newman, D.J. 2007. UCI Machine Learning Repository {http://www.ics.uci.edu/~mlearn/MLRepository.html}. Irvine, CA: University of California, School of Information and Computer Science.
|
| |
2
|
|
 |
3
|
|
| |
4
|
Everitt, B. 2001 Statistics for Psychologists: An Intermediate Course. Lawrence Erlbaum Associates.
|
| |
5
|
Frey L. J. and Fisher, D. H. 1999. Modeling Decision Tree Performance with the Power Law. In Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, 59--65.
|
| |
6
|
|
| |
7
|
Hastie, T., Tibshirani, R., and Friedman, J. 2003 The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag.
|
| |
8
|
Hettich, S. and Bay, S. D. 1999 The UCI KDD Archive {http://kdd.ics.uci.edu}. Irvine, CA: University of California, Department of Information and Computer Science.
|
| |
9
|
John G. and Langley, P. 1996. Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 367--370.
|
| |
10
|
|
| |
11
|
|
 |
12
|
Ingo Mierswa , Michael Wurst , Ralf Klinkenberg , Martin Scholz , Timm Euler, YALE: rapid prototyping for complex data mining tasks, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150531]
|
| |
13
|
Minium, E.W., Clarke, R.C., and Coladarci, T. 1999. Elements of Statistical Reasoning. New York: John Wiley&Sons, Inc. 2nd Ed.
|
| |
14
|
Montgomery, D.C., Runger, G.C., Hubele, N.F. 2007. Engineering Statistics, John Wiley&Sons, Inc. 4th Edition.
|
 |
15
|
Foster Provost , David Jensen , Tim Oates, Efficient progressive sampling, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.23-32, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312188]
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
 |
19
|
Victor S. Sheng , Foster Provost , Panagiotis G. Ipeirotis, Get another label? improving data quality and data mining using multiple, noisy labelers, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
[doi> 10.1145/1401890.1401965]
|
| |
20
|
Singh, S. 2005 Modeling Performance of Different Classification Methods: Deviation from the Power Law. Project Report, Department of Computer Science, Vanderbilt University, USA (April 2005).
|
| |
21
|
|
|