ACM Home Page
Please provide us with feedback. Feedback
Automatic text categorization in terms of genre and author
Full text Publisher SitePublisher Site PdfPdf (1.30 MB)
Source Computational Linguistics archive
Volume 26 ,  Issue 4  (December 2000) table of contents
Pages: 471 - 495  
Year of Publication: 2000
ISSN:0891-2017
Authors
Efstathios Stamatatos  University of Patras
George Kokkinakis  University of Patras
Nikos Fakotakis  University of Patras
Publisher
MIT Press  Cambridge, MA, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 45,   Citation Count: 24
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.1162/089120100750105920

Warning: The download time has expired please click on the item to try again.


ABSTRACT

The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identification, and author verification tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling. Various performance issues regarding the training set size and the significance of the proposed style markers are discussed. Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of defining analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Baayen, Harald, Hans Van Halteren, and Fiona Tweedie. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3):121--131.
 
2
Biber, Douglas. 1990. Methodological issues regarding corpus-based analyses of linguistic variations. Literary and Linguistic Computing, 5:257--269.
 
3
Biber, Douglas. 1993a. Representativeness in corpus design. Literary and Linguistic Computing, 8:1--15.
 
4
 
5
Biber, Douglas. 1995. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.
 
6
Brainerd, Barron. 1974. Weighting Evidence in Language and Literature: A Statistical Approach. University of Toronto Press.
 
7
Brinegar, Claude S. 1963. Mark Twain and the Quintus Curtius Snodgrass letters: A statistical test of authorship. Journal of the American Statistical Association, 58:85--96.
 
8
Brunet, Ettienne. 1978. Vocabulaire de Jean Giraudoux: Structure et Evolution. Slatkine.
 
9
Burrows, John F. 1987. Word-patterns and story-shapes: The statistical analysis of narrative style. Literary and Linguistic Computing, 2(2):61--70.
 
10
Burrows, John F. 1992. Not unless you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing, 7(2):91--109.
 
11
Edwards, Allen F. 1979. Multiple Regression and the Analysis of Variance and Covariance. W. H. Freeman, San Francisco, CA.
 
12
Eisenbeis, Robert A., and Robert B. Avery. 1972. Discriminant Analysis and Classification Procedures: Theory and Applications. D.C. Health and Co., Lexington, MA.
 
13
 
14
Holmes, David I. 1992. A stylometric analysis of Mormon scripture and related texts. Journal of the Royal Statistical Society, Series A, 155(1):91--120.
 
15
Holmes, David I. 1994. Authorship attribution. Computers and the Humanities, 28:87--106.
 
16
Holmes, David I., and Richard S. Forsyth. 1995. The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2):111--127.
 
17
Honoré Antony. 1979. Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2):172--177.
 
18
Karlgren, Jussi. 1999. Stylistic experiments in information retrieval. In T. Strzalkowski, editor, Natural Language Information Retrieval. Kluwer Academic Publishers, pages 147--166.
 
19
 
20
 
21
 
22
Morton, Andrew Q. 1965. The authorship of Greek prose. Journal of the Royal Statistical Society, Series A, 128:169--233.
 
23
Mosteller, Fredrick and David Wallace. 1984. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading, MA.
 
24
 
25
Sichel, Herbert S. 1975. On a distribution law for word frequencies. Journal of the American Statistical Association, 70:542--547.
 
26
Simpson, Edward H. 1949. Measurement of diversity. Nature, 163:688.
 
27
Smith, M. W. A. 1983. Recent experience and new developments of methods for the determination of authorship. Association for Literary and Linguistic Computing Bulletin, 11:73--82.
 
28
 
29
Stamatatos, Efstathios, Nikos Fakotakis, and George Kokkinakis. 1999. Automatic extraction of rules for sentence boundary disambiguation. In Proceedings of the Workshop in Machine Learning in Human Language Technology, Advance Course on Artificial Intelligence (ACAI'99), pages 88--92.
 
30
 
31
 
32
Tabachnick, Barbara G. and Linda S. Fidell. 1996. Using Multivariate Statistics. Third edition. HarperCollins College Publishers.
 
33
Tweedie, Fiona and Harald Baayen. 1998. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5):323--352.
 
34
Tweedie, Fiona, Sameer Singh, and David I. Holmes. 1996. Neural network applications in stylometry: The Federalist Papers. Computers and the Humanities, 30(1):1--10.
 
35
36
 
37
Yule, George U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.

CITED BY  24
Collaborative Colleagues:
Efstathios Stamatatos: colleagues
George Kokkinakis: colleagues
Nikos Fakotakis: colleagues