ACM Home Page
Please provide us with feedback. Feedback
Text classification from positive and unlabeled documents
Full text PdfPdf (216 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the twelfth international conference on Information and knowledge management table of contents
New Orleans, LA, USA
SESSION: Knowledge management session 3: classification table of contents
Pages: 232 - 239  
Year of Publication: 2003
ISBN:1-58113-723-0
Authors
Hwanjo Yu  University of Illinois, IL
ChengXiang Zhai  University of Illinois, IL
Jiawei Han  University of Illinois, IL
Sponsors
ACM: Association for Computing Machinery
SIGMIS: ACM Special Interest Group on Management Information Systems
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 114,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956863.956909
What is a DOI?

ABSTRACT

Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17]for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfitting at some point and end up generating very poor results.This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
5
 
6
 
7
 
8
 
9
 
10
 
11
B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1083--1121, 2000.
12
 
13
 
14
V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
15
16
 
17
H. Yu. SVMC: Single-class classification with support vector machines. In Proc. Int. Joint Conf. on Articial Intelligence (IJCAI-03), Acapulco, Maxico, 2003.
18

CITED BY  9

Collaborative Colleagues:
Hwanjo Yu: colleagues
ChengXiang Zhai: colleagues
Jiawei Han: colleagues