ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Stemming and lemmatization in the clustering of finnish text documents
Full text PdfPdf (239 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the thirteenth ACM international conference on Information and knowledge management table of contents
Washington, D.C., USA
SESSION: IR-7 (information retrieval): natural language processing for IR table of contents
Pages: 625 - 633  
Year of Publication: 2004
ISBN:1-58113-874-1
Authors
Tuomo Korenius  University of Tampere, Finland
Jorma Laurikkala  University of Tampere, Finland
Kalervo Järvelin  University of Tampere, Finland
Martti Juhola  University of Tampere, Finland
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 65,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1031171.1031285
What is a DOI?

ABSTRACT

Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents were evaluated with a four-point relevance assessment scale, which was collapsed into binary one by considering all the relevant and only the highly relevant documents relevant, respectively. Experiments with four hierarchical clustering methods supported the hypothesis. The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming. In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision. We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
Pirkola, A. Morphological typology of languages for information retrieval. Journal of Documentation, 57, 3 (2001), 330--348.
 
4
Harman D. How effective is suffixing? Journal of the American Society for Information Science, 42, 1 (1991), 7--15.
 
5
 
6
Popovic, M., and Willett, P. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43, 1 (1992), 384--390.
 
7
 
8
Kalamboukis, T. Z. Suffix stripping with modern Greek. Program, 29, 3 (1995), 313--321.
 
9
 
10
Rosell, M. Improving clustering of Swedish newspaper articles using stemming and compound splitting. In Fourteenth Nordic Conference on Computational Linguistics (NoDaLiDa 2003) (Reykjavik, Island, May 30-31, 2003). http://www.nada.kth.se/~rosell/publications/papers/improvingClustering03.pdf
 
11
Matthews, P. H. The Concise Oxford Dictionary of Linguistics. Oxford University Press, Oxford - New York, NY, 1997.
 
12
 
13
 
14
 
15
Kekäläinen, J. The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Ph.D. Thesis, University of Tampere, 1999. Acta Universitatis Tamperensis, vol. 678.
 
16
Sormunen, E. A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases. Ph.D. Thesis, University of Tampere, 2000. Acta Universitatis Tamperensis, vol. 748.
 
17
 
18
Karlsson, F. Finnish grammar. WSOY, Porvoo, 1987.
 
19
Koskenniemi, K. An application of the two-level model to Finnish. In Computational morphosyntax: Report on research 1981-84. Publications 13, University of Helsinki, Department of General Linguistics, Helsinki, 1985, 19--41.
 
20
Koskenniemi, K. Two-level morphology: A general computational model for word-form recognition and production. Publications 11, University of Helsinki, Department of General Linguistics, Helsinki, 1983.
 
21
Porter, M. F. Snowball: A language for stemming algorithms, 2001. http://snowball.tartarus.org/
 
22
Jolliffe, I. T. Principal Components Analysis. Springer-Verlag, New York, 1986.
 
23
Korenius, T., Laurikkala, J., and Juhola, M. On applying the principal components analysis and cosine similarity for information retrieval. A manuscript submitted to Information Processing & Management.
 
24
 
25
 
26
Everitt, B. S., Landau, S., and Leese, M. Cluster Analysis. Arnold, London, 2001.
 
27
The Math Works Inc. Statistics Toolbox User's Guide. The Math Works Inc., Natick, 2002.
 
28
Pett, M. A. Nonparametric Statistics for Health Care Research. Sage Publications, Thousand Oaks, CA, 1997.
 
29


Collaborative Colleagues:
Tuomo Korenius: colleagues
Jorma Laurikkala: colleagues
Kalervo Järvelin: colleagues
Martti Juhola: colleagues