ACM Home Page
Please provide us with feedback. Feedback
Learning classifiers from only positive and unlabeled data
Full text PdfPdf (473 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 213-220  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Charles Elkan  University of California, San Diego, La Jolla, CA, USA
Keith Noto  University of California, San Diego, La Jolla, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 463,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401920
What is a DOI?

ABSTRACT

The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved in this paper is how to learn a standard binary classifier given a nontraditional training set of this nature.

Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in two different ways to learn a classifier from a nontraditional training set. We then apply these two new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Our experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O'Donovan, I. Phan, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365--370, 2003.
 
2
 
3
 
4
 
5
F. Denis, R. Gilleron, and M. Tommasi. Text classification from positive and unlabeled examples. In Proceedings of the Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002), pages 1927--1934, 2002.
 
6
 
7
M. Galperin. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2, 2008.
 
8
W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington, DC, pages 448--455, 2003.
 
9
 
10
 
11
 
12
Z. Liu, W. Shi, D. Li, and Q. Qin. Partially supervised classification - based on weighted unlabeled samples support vector machine. In Proceedings of the First International Conference on Advanced Data Mining and Applications (ADMA 2005), Wuhan, China, volume 3584 of Lecture Notes in Computer Science, pages 118--129. Springer, 2005.
 
13
 
14
J. C. Platt. Probabilities for SV machines. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--73. MIT Press, 1999.
 
15
M. H. Saier, C. V. Tran, and R. D. Barabote. TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research, 34:D181--D186, 2006.
 
16
17
18
 
19
 
20
 
21
G. Ward, T. Hastie, S. Barry, J. Elith, and J. R. Leathwick. Presence-only data and the EM algorithm. Biometrics, 2008. In press.
22
 
23
 
24
25
 
26
D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unlabeled examples. In Proceedings of the 5th Annual UK Workshop on Computational Intelligence (UKCI), pages 83--87, Sept. 2005.


Collaborative Colleagues:
Charles Elkan: colleagues
Keith Noto: colleagues