|
ABSTRACT
Subject or prepositional content has been the focus of most classification research. Genre or style, on the other hand, is a different and important property of text, and automatic text genre classification is becoming important for classification and retrieval purposes as well as for some natural language processing research. In this paper, we present a method for automatic genre classification that is based on statistically selected features obtained from both subject-classified and genre classified training data. The experimental results show that the proposed method outperforms a direct application of a statistical learner often used for subject classification. We also observe that the deviation formula and discrimination formula using document frequency ratios also work as expected. We conjecture that this dual feature set approach can be generalized to improve the performance of subject classification as well.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Ivan Bretan, John Dewe, Anders Hallberg, Niklas Wolkert, Jussi Karlgren, "Web-Specific Genre Visualization", Proc. of the 30th Hawaii International Conference on System Science, Jan 1997.
|
| |
2
|
Johan Dewe, Jussi Karlgren, Ivan Bretan, "Assembling a Balanced Corpus from the Internet", 11th Nordic Conference of Computational Linguistics, pages 100--107, Copenhagen, 1998.
|
| |
3
|
|
| |
4
|
Jussi Karlgren, "Stylistic Variation in an Information Retrieval Experiment", Proc. of the 2nd International Conference on New Methods in Language Processing-NeMLaP, 1996.
|
| |
5
|
Jussi Karlgren, Ivan Brettan, Johan Dewe, Anders Hallberg, Niklas Wolkert, "Iterative Information Retrieval Using Fast Clustering and Usage-Specific Genres", 8th DELOS Workshop on User Interfaces in Digital Libraries, pages 85--92, 1998.
|
| |
6
|
|
| |
7
|
|
| |
8
|
D. Lewis and M. Ringuette, "Compariosn of two learning algorithms for text categorization," Proc. of the 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
CITED BY 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ivan Bulyko , Mari Ostendorf , Manhung Siu , Tim Ng , Andreas Stolcke , Özgür Çetin, Web resources for language modeling in conversational speech recognition, ACM Transactions on Speech and Language Processing (TSLP), v.5 n.1, p.1-25, December 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|