|
ABSTRACT
Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents were evaluated with a four-point relevance assessment scale, which was collapsed into binary one by considering all the relevant and only the highly relevant documents relevant, respectively. Experiments with four hierarchical clustering methods supported the hypothesis. The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming. In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision. We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
Pirkola, A. Morphological typology of languages for information retrieval. Journal of Documentation, 57, 3 (2001), 330--348.
|
| |
4
|
Harman D. How effective is suffixing? Journal of the American Society for Information Science, 42, 1 (1991), 7--15.
|
| |
5
|
|
| |
6
|
Popovic, M., and Willett, P. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43, 1 (1992), 384--390.
|
| |
7
|
|
| |
8
|
Kalamboukis, T. Z. Suffix stripping with modern Greek. Program, 29, 3 (1995), 313--321.
|
| |
9
|
|
| |
10
|
Rosell, M. Improving clustering of Swedish newspaper articles using stemming and compound splitting. In Fourteenth Nordic Conference on Computational Linguistics (NoDaLiDa 2003) (Reykjavik, Island, May 30-31, 2003). http://www.nada.kth.se/~rosell/publications/papers/improvingClustering03.pdf
|
| |
11
|
Matthews, P. H. The Concise Oxford Dictionary of Linguistics. Oxford University Press, Oxford - New York, NY, 1997.
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
Kekäläinen, J. The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Ph.D. Thesis, University of Tampere, 1999. Acta Universitatis Tamperensis, vol. 678.
|
| |
16
|
Sormunen, E. A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases. Ph.D. Thesis, University of Tampere, 2000. Acta Universitatis Tamperensis, vol. 748.
|
| |
17
|
|
| |
18
|
Karlsson, F. Finnish grammar. WSOY, Porvoo, 1987.
|
| |
19
|
Koskenniemi, K. An application of the two-level model to Finnish. In Computational morphosyntax: Report on research 1981-84. Publications 13, University of Helsinki, Department of General Linguistics, Helsinki, 1985, 19--41.
|
| |
20
|
Koskenniemi, K. Two-level morphology: A general computational model for word-form recognition and production. Publications 11, University of Helsinki, Department of General Linguistics, Helsinki, 1983.
|
| |
21
|
Porter, M. F. Snowball: A language for stemming algorithms, 2001. http://snowball.tartarus.org/
|
| |
22
|
Jolliffe, I. T. Principal Components Analysis. Springer-Verlag, New York, 1986.
|
| |
23
|
Korenius, T., Laurikkala, J., and Juhola, M. On applying the principal components analysis and cosine similarity for information retrieval. A manuscript submitted to Information Processing & Management.
|
| |
24
|
|
| |
25
|
|
| |
26
|
Everitt, B. S., Landau, S., and Leese, M. Cluster Analysis. Arnold, London, 2001.
|
| |
27
|
The Math Works Inc. Statistics Toolbox User's Guide. The Math Works Inc., Natick, 2002.
|
| |
28
|
Pett, M. A. Nonparametric Statistics for Health Care Research. Sage Publications, Thousand Oaks, CA, 1997.
|
| |
29
|
|
|