|
ABSTRACT
The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved in this paper is how to learn a standard binary classifier given a nontraditional training set of this nature. Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in two different ways to learn a classifier from a nontraditional training set. We then apply these two new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Our experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O'Donovan, I. Phan, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365--370, 2003.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
F. Denis, R. Gilleron, and M. Tommasi. Text classification from positive and unlabeled examples. In Proceedings of the Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002), pages 1927--1934, 2002.
|
| |
6
|
|
| |
7
|
M. Galperin. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2, 2008.
|
| |
8
|
W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington, DC, pages 448--455, 2003.
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
Z. Liu, W. Shi, D. Li, and Q. Qin. Partially supervised classification - based on weighted unlabeled samples support vector machine. In Proceedings of the First International Conference on Advanced Data Mining and Applications (ADMA 2005), Wuhan, China, volume 3584 of Lecture Notes in Computer Science, pages 118--129. Springer, 2005.
|
| |
13
|
|
| |
14
|
J. C. Platt. Probabilities for SV machines. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--73. MIT Press, 1999.
|
| |
15
|
M. H. Saier, C. V. Tran, and R. D. Barabote. TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research, 34:D181--D186, 2006.
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
G. Ward, T. Hastie, S. Barry, J. Elith, and J. R. Leathwick. Presence-only data and the EM algorithm. Biometrics, 2008. In press.
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
 |
25
|
|
| |
26
|
D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unlabeled examples. In Proceedings of the 5th Annual UK Workshop on Computational Intelligence (UKCI), pages 83--87, Sept. 2005.
|
CITED BY
|
|
Foster Provost , Brian Dalessandro , Rod Hook , Xiaohan Zhang , Alan Murray, Audience selection for on-line brand advertising: privacy-friendly social network targeting, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|