ACM Home Page
Please provide us with feedback. Feedback
Classification algorithms for NETNEWS articles
Full text PdfPdf (1.01 MB)
Source Conference on Information and Knowledge Management archive
Proceedings of the eighth international conference on Information and knowledge management table of contents
Kansas City, Missouri, United States
Pages: 114 - 121  
Year of Publication: 1999
ISBN:1-58113-146-1
Authors
Wen-Lin Hsu  School of Computer Science, University of Central Florida, Orlando, FL
Sheau-Dong Lang  School of Computer Science, University of Central Florida, Orlando, FL
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGIR: ACM Special Interest Group on Information Retrieval
SIGMIS: ACM Special Interest Group on Management Information Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 20,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/319950.319965
What is a DOI?

ABSTRACT

We propose several algorithms using the vector space model to classify the news articles posted on the NETNEWS according to the newsgroup categories. The baseline method combines the terms of all the articles of each newsgroup in the training set to represent the newsgroups as single vectors. After training, the incoming news articles are classified based on their similarity to the existing newsgroup categories. We propose to use the following techniques to improve the classification performance of the baseline method: (1) use routing (classification) accuracy and the similarity values to refine the training set; (2) update the underlying term structures periodically during testing; and (3) apply k-means clustering to partition the newsgroup articles and represent each newsgroup by k vectors. Our test collection consists of the real news articles and the 519 subnewsgroups under the REC newsgroup of NETNEWS in a period of 3 months. Our experimental results demonstrate that the technique of refining the training set reduces from one-third to two-thirds of the storage. The technique of periodical updates improves the routing accuracy ranging from 20% to 100% but incurs runtime overhead. Finally, representing each newsgroup by k vectors (with k = 2 or 3) using clustering yields the most significant improvement in routing accuracy, ranging from 60% to 100%, while causing only slightly higher storage requirements.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
W. Francis and H. Kucera. Frequency Analysis of English Usage, New York: Houghton Mifflin, 1982.
 
5
 
6
J. Gonzalo, F. Verdejo, I. Chugur, J Cigarran. "Indexing with WordNet Synsets can Improve Text Retrieval", Coling-ACL'98 Workshop: Usage of Word- Net in Natural Language Processing Systems, pp. 38- 44, August 1998.
 
7
A.D. Gordon. Classification, Chapman and Hall, 1981.
 
8
W. Hsu, and S. Lang, "NETNEWS Classification via Batch Routing and Updates", Proceedings of international Conference of Information Resources Management Association, May 1999.
 
9
 
10
11
 
12
K. Lang. "Newsweeder: Learning to filter netnews", Proceedings of International Conference on Machine Learning, pp. 331-339, July 1995.
13
 
14
P.C. Mahalanobis, "On the Generalized Distance in Statistics", Proceedings of the National Institute of Science of India, 12, pp. 49-55, 1936.
 
15
H. Mase. "Experiments on Automatic Web Page Categorization for IR system", technical report, Stanford University, 1998.
 
16
 
17
M.F. Porter. "An Algorithm for Suffix Stripping", Program, 14(3), pp. 130-137, 1980.
 
18
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk email", Proceedings of AAAi'98 Workshop on Learning for Text Categorization, 1998, Madison, Wisconsin.
 
19
 
20
 
21
S. Scott, and S. Matwin. "Text Classification Using WordNet Hypernyms', Coting-ACL'98 Workshop: Usage of WordNet in Natural Language Processing Systems, pp. 45-51, August 1998.
22
 
23
S. A. Weiss, S. Kasif, and E. Brill. "Text Classification in USENET Newsgroups: A Progress Report", Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pp. 125-127, 1996
 
24
 
25
26
 
27

CITED BY  7

Collaborative Colleagues:
Wen-Lin Hsu: colleagues
Sheau-Dong Lang: colleagues