ACM Home Page
Please provide us with feedback. Feedback
Improving text categorization bootstrapping via unsupervised learning
Full text PdfPdf (278 KB)
Source
ACM Transactions on Speech and Language Processing (TSLP) archive
Volume 6 ,  Issue 1  (October 2009) table of contents
Article No. 1  
Year of Publication: 2009
ISSN:1550-4875
Authors
Alfio Gliozzo  STLab-ISTC-CNR, Rome
Carlo Strapparava  FBK-IRST, Povo
Ido Dagan  Bar Ilan University
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 81,   Downloads (12 Months): 81,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1596515.1596516
What is a DOI?

ABSTRACT

We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70--160 labeled documents per category.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Abney, S. 2002. Bootstrapping. In Proceeding of the 40th Annual Meeting of the Association for Computational Linguistics (ACL'02).
 
2
Abney, S. 2004. Understanding the Yarowsky algorithm. Comput. Linguist. 30, 3.
 
3
Adami, G., Avesani, P., and Sona, D. 2003. Bootstrapping for hierarchical document classication. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM'03).
 
4
Barendregt, H. 1984. The Lambda Calculus: Its Syntax and Semantics. North Holland, Amsterdam.
 
5
Bekkerman, R. 2003. Distributional clustering of words for text categorization. M.S. thesis, Technion-Israel Institute of Technology.
 
6
Berry, M. 1992. Large-scale sparse singular value computations. Int. J. Supercomput. Appl. 6, 1, 13--49.
 
7
Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT). 92--100.
 
8
Collins, M. and Singer, Y. 1999. Unsupervised models for named entity classification. In Proceedings of the EMNLP'99 Conference.
 
9
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci.
 
10
Fellbaum, C. 1998. WordNet. An Electronic Lexical Database. MIT Press, Cambridge, MA.
 
11
Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Machine Learn. Resear. 8, 2297--2345.
 
12
Gliozzo, A. and Strapparava, C. 2005. Domains kernels for text categorization. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL'05).
 
13
Gliozzo, A., Strapparava, C., and Dagan, I. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Comput. Speech Lang. 18, 275--299.
 
14
Gliozzo, A., Strapparava, C., and Dagan, I. 2005. Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the Joint Conference on Human Language Technology/Empirical Methods in Natural Language Processing (HLT/EMNLP).
 
15
Godbole, S., Harpale, A., Sarawagi, S., and Chakrabarti, S. 2004. Document classication through interactive supervision of document and term labels. In Proceedings of the 15th European Conference on Machine Learning (ECML) and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD).
 
16
Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., Eds. MIT Press, Cambridge, MA, 169--184.
 
17
Ko, Y. and Seo, J. 2000. Automatic text categorization by unsupervised learning. In Proceedings of the 18th International Conference on Computational Linguistics.
 
18
Ko, Y. and Seo, J. 2002. Text categorization using feature projections. In Proceedings of the International Conference on Computational Linguistics.
 
19
Ko, Y. and Seo, J. 2004. Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL'04).
 
20
Lang, K. 1995. NewsWeeder: Learning to filter netnews. In Proceedings of the12th International Conference on Machine Learning (ICML'95). 331--339.
 
21
Liu, B., Li, X., Lee, W. S., and Yu, P. S. 2004. Text classification by labeling words. In Proceedings of the Conference on Natural Language Processing and Information Extraction.
 
22
Magnini, B. and Cavaglia, G. 2000. Integrating subject field codes into WordNet. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC'00). 1413--1418.
 
23
Magnini, B., Strapparava, C., Pezzulo, G., and Gliozzo, A. 2001. Using domain information for word sense disambiguation. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL2 ). 111--114.
 
24
Magnini, B., Strapparava, C., Pezzulo, G., and Gliozzo, A. 2002. The role of domain information in word sense disambiguation. Natural Lang. Engin. 8, 4, 359--373.
 
25
McCallum, A. and Nigam, K. 1999. Text classification by bootstrapping with keywords, EM and shrinkage. In Proceedings of the Workshop for Unsupervised Learning in Natural Language Processing (ACL'99).
 
26
Redner, R. and Walker, H. 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26, 2, 195--239.
 
27
Rifkin, R. and Klautau, A. 2004. In defense of one-vs-all classification. J. Machine Learn. Resear. 5, 101--141.
 
28
Salton, G. and McGill, M. 1983. In Introduction to Modern Information Retrieval. McGraw-Hill, New York.
 
29
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47.
 
30
Silverman, B. W. 1986. In Density Estimation for Statistics and Data Analysis. Chapman and Hall.
 
31
Vapnik, V. 1995. The Nature of Statistical Learning Theory. Springer, Berlin.
 
32
Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. 189--196.
 
33
Zhang, Y. and Callan, J. 2001. Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), ACM, New York.