ACM Home Page
Please provide us with feedback. Feedback
A parallel learning algorithm for text classification
Full text PdfPdf (498 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
SESSION: Text classification table of contents
Pages: 201 - 206  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Canasai Kruengkrai  Kasetsart University, Bangkok, Thailand
Chuleerat Jaruskulchai  Kasetsart University, Bangkok, Thailand
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 74,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775077
What is a DOI?

ABSTRACT

Text classification is the process of classifying documents into predefined categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient labeled documents to learn accurately. Applying the Expectation-Maximization (EM) algorithm to this problem is an alternative approach that utilizes a large pool of unlabeled documents to augment the available labeled documents. Unfortunately, the time needed to learn with these large unlabeled documents is too high. This paper introduces a novel parallel learning algorithm for text classification task. The parallel algorithm is based on the combination of the EM algorithm and the naive Bayes classifier. Our goal is to improve the computational time in learning and classifying process. We studied the performance of our parallel algorithm on a large Linux PC cluster called PIRUN Cluster. We report both timing and accuracy results. These results indicate that the proposed parallel algorithm is capable of handling large document collections.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
Forman, G., and Zhang, B. Linear speed-up for a parallel non-approximate recasting of center-based clustering algorithms, including k-means, k-harmonic means, and EM. KDD Workshop on Distributed and Parallel Knowledge Discovery, 2000.
 
5
Goharian, N., El-Ghazawi, T., Grossman, D., and Chowdhury, A. On the enhancements of a sparse matrix information retrieval approach. Proceedings of the International Conference on Parallel and distributed Processing Techniques and Applications, 1999.
 
6
 
7
 
8
 
9
Lewis, D., and Ringuette, M. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81--93, 1994.
 
10
McCallum, A., and Nigam, K. A comparison of events models for naive Bayes text classification. Papers from the AAAI Workshop, pages 41--48, 1998.
 
11
MeLachlan, G.J., and Krishnan, T. The EM algorithm and extensions. John Wiley & Sons, 1997.
 
12
 
13
 
14
Ridge, D., Becker D., and Merkey, P. 1997. Beowulf: Harnessing the power of parallelism in a Pile-of-PCs. Proceedings, IEEE Aerospace.
 
15
 
16


Collaborative Colleagues:
Canasai Kruengkrai: colleagues
Chuleerat Jaruskulchai: colleagues