ACM Home Page
Please provide us with feedback. Feedback
Making generative classifiers robust to selection bias
Full text PdfPdf (972 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 657 - 666  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Andrew T. Smith  University of California: San Diego
Charles Elkan  University of California: San Diego
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 114,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281263
What is a DOI?

ABSTRACT

This paper presents approaches to semi-supervised learning when the labeled training data and test data are differently distributed. Specifically, the samples selected for labeling are a biased subset of some general distribution and the test set consists of samples drawn from either that general distribution or the distribution of the unlabeled samples. An example of the former appears in loan application approval, where samples with repay/default labels exist only for approved applicants and the goal is to model the repay/default behavior of all applicants. An example of the latter appears in spam filtering, in which the labeled samples can be out-dated due to the cost of labeling email by hand, but an unlabeled set of up-to-date emails exists and the goal is to build a filter to sort new incoming email.Most approaches to overcoming such bias in the literature rely on the assumption that samples are selected for labeling depending only on the features, not the labels, a case in which provably correct methods exist. The missing labels are said to be "missing at random" (MAR). In real applications, however, the selection bias can be more severe. When the MAR conditional independence assumption is not satisfied and missing labels are said to be "missing not at random" (MNAR), and no learning method is provably always correct.We present a generative classifier, the shifted mixture model (SMM), with separate representations of the distributions of the labeled samples and the unlabeled samples. The SMM makes no conditional independence assumptions and can model distributions of semi-labeled data sets with arbitrary bias in the labeling. We present a learning method based on the expectation maximization (EM) algorithm that, while not always able to overcome arbitrary labeling bias, learns SMMs with higher test-set accuracy in real-world data sets (with MNAR bias) than existing learning methods that are proven to overcome MAR bias.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
K. Benson and A. J. Hartz. A comparison of observational studies and randomized controlled trials. The New England Journal of Medicine, 342(25): 1878--1886, 2000.
2
 
3
S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 161--168. MIT Press, Cambridge, MA, 2007.
 
4
W. J. Boyes, D. J. Hoffman, and S. A. Low. An econometric analysis of the bank credit scoring problem. Journal of Econometrics, 40(1):3--14, 1989.
 
5
C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language, 20(4):382--399, 2006.
 
6
G. Chen and T. Astebro. The economic value of reject inference in credit scoring. In J. N. C. L. C. Thomas and D. B. Edelman, editors, Credit Scoring and Credit Control VII: Proceedings of Conference held at University of Edinburgh, Edinburgh, Scotland, 5--7 September, 2001.
 
7
D. A. Cobb-Clark and T. Crossley. Econometrics for evaluations: An introduction to recent developments. The Economic Record, 79(247):491--511, 2003.
 
8
J. Crook and J. Banasik. Does reject inference really improve the performance of application scoring models? Technical Report Working Paper Series No. 02/3, Credit Research Centre, 2002.
 
9
A. Dempster, N. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39:1--38, 1977.
 
10
A. J. Feelders. An overview of model based reject inference for credit scoring. Technical report, Utrecht University, Institute for Information and Computing Sciences, (unpublished). http://www.cs.uu.nl/people/ad/mbrejinf.pdf.
 
11
 
12
J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007.
 
13
 
14
 
15
R. Pace and R. Barry. Sparse spatial autoregressions. Statistics and Probability Letters, 33:291--297, 1997.
 
16
J. Pearl. Graphical models for probabilistic and causal reasoning. In D. M. Gabbay and P. Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, Volume 1: Quantified Representation of Uncertainty and Imprecision, pages 367--389. Kluwer Academic Publishers, Dordrecht, 1998.
 
17
J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5):550--560, 2000.
 
18
P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--56, 1983.
 
19
S. Rosset, J. Zhu, H. Zou, and T. Hastie. A method for inferring label sampling mechanisms in semi-supervised learning. Advances in Neural Information Processing Systems, 17: 1161--1168, 2005.
 
20
H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2): 227--244, 2000.
21
 
22
A. Storkey and M. Sugiyama. Mixture regression for covariate shift. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1337--1344. MIT Press, Cambridge, MA, 2007.
 
23
M. Sugiyama and K.-R. Möller. Model selection under covariate shift. In W. Duch, J. Kacprzyk, E. Oja, and S. Zadrozny, editors, Artificial Neural Networks: Formal Models and Their Applications, volume 3697 of Lecture Notes in Computer Science, pages 235--240, Berlin, 2005. Springer.
 
24
A. J. Treno, P. J. Gruenewald, and F. W. Johnson. Sample selection bias in the emergency room: an examination of the role of alcohol in injury. Addiction, 93(1): 113--29, 1998.
25
26
27


Collaborative Colleagues:
Andrew T. Smith: colleagues
Charles Elkan: colleagues