|
ABSTRACT
We propose several algorithms using the vector space model to classify the news articles posted on the NETNEWS according to the newsgroup categories. The baseline method combines the terms of all the articles of each newsgroup in the training set to represent the newsgroups as single vectors. After training, the incoming news articles are classified based on their similarity to the existing newsgroup categories. We propose to use the following techniques to improve the classification performance of the baseline method: (1) use routing (classification) accuracy and the similarity values to refine the training set; (2) update the underlying term structures periodically during testing; and (3) apply k-means clustering to partition the newsgroup articles and represent each newsgroup by k vectors. Our test collection consists of the real news articles and the 519 subnewsgroups under the REC newsgroup of NETNEWS in a period of 3 months. Our experimental results demonstrate that the technique of refining the training set reduces from one-third to two-thirds of the storage. The technique of periodical updates improves the routing accuracy ranging from 20% to 100% but incurs runtime overhead. Finally, representing each newsgroup by k vectors (with k = 2 or 3) using clustering yields the most significant improvement in routing accuracy, ranging from 60% to 100%, while causing only slightly higher storage requirements.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
3
|
|
| |
4
|
W. Francis and H. Kucera. Frequency Analysis of English Usage, New York: Houghton Mifflin, 1982.
|
| |
5
|
|
| |
6
|
J. Gonzalo, F. Verdejo, I. Chugur, J Cigarran. "Indexing with WordNet Synsets can Improve Text Retrieval", Coling-ACL'98 Workshop: Usage of Word- Net in Natural Language Processing Systems, pp. 38- 44, August 1998.
|
| |
7
|
A.D. Gordon. Classification, Chapman and Hall, 1981.
|
| |
8
|
W. Hsu, and S. Lang, "NETNEWS Classification via Batch Routing and Updates", Proceedings of international Conference of Information Resources Management Association, May 1999.
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
K. Lang. "Newsweeder: Learning to filter netnews", Proceedings of International Conference on Machine Learning, pp. 331-339, July 1995.
|
 |
13
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
14
|
P.C. Mahalanobis, "On the Generalized Distance in Statistics", Proceedings of the National Institute of Science of India, 12, pp. 49-55, 1936.
|
| |
15
|
H. Mase. "Experiments on Automatic Web Page Categorization for IR system", technical report, Stanford University, 1998.
|
| |
16
|
|
| |
17
|
M.F. Porter. "An Algorithm for Suffix Stripping", Program, 14(3), pp. 130-137, 1980.
|
| |
18
|
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk email", Proceedings of AAAi'98 Workshop on Learning for Text Categorization, 1998, Madison, Wisconsin.
|
| |
19
|
|
| |
20
|
|
| |
21
|
S. Scott, and S. Matwin. "Text Classification Using WordNet Hypernyms', Coting-ACL'98 Workshop: Usage of WordNet in Natural Language Processing Systems, pp. 45-51, August 1998.
|
 |
22
|
Anthony Tomasic , Héctor García-Molina , Kurt Shoens, Incremental updates of inverted lists for text document retrieval, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.289-300, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
23
|
S. A. Weiss, S. Kasif, and E. Brill. "Text Classification in USENET Newsgroups: A Progress Report", Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pp. 125-127, 1996
|
| |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
CITED BY 7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Souneil Park , Seungwoo Kang , Sangjeong Lee , Sangyoung Chung , Junehwa Song, Mitigating media bias: a computational approach, Proceedings of the hypertext 2008 workshop on Collaboration and collective intelligence, June 19-21, 2008, Pittsburgh, PA, USA
|
|
|
|
|
|
|
|