ACM Home Page
Please provide us with feedback. Feedback
Reverse testing: an efficient framework to select amongst classifiers under sample selection bias
Full text PdfPdf (783 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Philadelphia, PA, USA
SESSION: Research track papers table of contents
Pages: 147 - 156  
Year of Publication: 2006
ISBN:1-59593-339-5
Authors
Wei Fan  IBM T. J. Watson Research, Hawthorne, NY
Ian Davidson  University of Albany, State University of New York, Albany, NY
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 87,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1150402.1150422
What is a DOI?

ABSTRACT

One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the so-called "stationary distribution assumption" that the future and the past data sets are identical from a probabilistic standpoint. In many domains of real-world applications, such as marketing solicitation, fraud detection, drug testing, loan approval, sub-population surveys, school enrollment among others, this is rarely the case. This is because the only labeled sample available for training is biased in different ways due to a variety of practical reasons and limitations. In these circumstances, traditional methods to evaluate the expected generalization error of classification algorithms, such as structural risk minimization, ten-fold cross-validation, and leave-one-out validation, usually return poor estimates of which classification algorithm, when trained on biased dataset, will be the most accurate for future unbiased dataset, among a number of competing candidates. Sometimes, the estimated order of the learning algorithms' accuracy could be so poor that it is not even better than random guessing. Therefore,a method to determine the most accurate learner is needed for data mining under sample selection bias for many real-world applications. We present such an approach that can determine which learner will perform the best on an unbiased test set, given a possibly biased training set, in a fraction of the computational cost to use cross-validation based approaches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47:153--161.
 
3
 
4
McCallum, A. (1998). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. CMU TR.
 
5
 
6
Moore, A. A Tutorial on the VC Dimension for Characterizing Classifiers, Available from the Website: www.cs.cmu.edu/~awm/tutorials
 
7
Rennie, J. 20 Newsgroups, (2003). Technical Report, Dept C.S., MIT.
 
8
Rosset, S., Zhu, J., Zou, H., and Hastie, T. (2005). A method for inferring label sampling mechanisms in semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 1161--1168. MIT Press.
9
10
 
11
 
12
13
14