| YASS: Yet another suffix stripper |
| Full text |
Pdf
(403 KB)
|
Source
|
ACM Transactions on Information Systems (TOIS)
archive
Volume 25 , Issue 4 (October 2007)
table of contents
Article No. 18
Year of Publication: 2007
ISSN:1046-8188
|
|
Authors
|
|
Prasenjit Majumder
|
Indian Statistical Institute, Kolkata, India
|
|
Mandar Mitra
|
Indian Statistical Institute, Kolkata, India
|
|
Swapan K. Parui
|
Indian Statistical Institute, Kolkata, India
|
|
Gobinda Kole
|
Indian Statistical Institute, Kolkata, India
|
|
Pabitra Mitra
|
Indian Institute of Technology, Kharagpur, India
|
|
Kalyankumar Datta
|
Jadavpur University, Calcutta, India
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 12, Downloads (12 Months): 125, Citation Count: 4
|
|
|
ABSTRACT
Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porter's and Lovin's stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porter's and Lovin's stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Adamson, G. and Boreham, J. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Inf. Stor. Retrieval 10, 253--260.
|
| |
2
|
|
| |
3
|
Buckley, C., Singhal, A., and Mitra, M. 1996. Using query zoning and correlation within SMART: TREC 5. In the 5th Text Retrieval Conference.
|
| |
4
|
Buckley, C., Singhal, A., and Mitra, M. 1995. New retrieval approaches using SMART: TREC 4. In the 4th Text Retrieval Conference.
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Hsu, J. 1986. Multiple Comparisons: Theory and Methods. Chapman and Hall.
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
Levenstein, V. I. 1966. Binary codes capable of correcting deletions, insertions and reversals. Commun. ACM 27, 4, 358--368
|
| |
13
|
Lovins, J. 1968. Development of a stemming algorithm. Mech. Trans. Comput. Linguis. 11, 22--31.
|
| |
14
|
Majumder, P., Mitra, M., and Chaudhuri, B. 2004. Construction and statistical analysis of an Indic language corpus for applied language research. Computing Science Tech. Rep. TR/ISI/CVPR/01/2004, CVPR Unit, Indian Statistical Institute, Kolkata.
|
| |
15
|
|
| |
16
|
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
|
| |
17
|
Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages (Budapest, Apr.) Workshop.
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
CITED BY 4
|
|
|
|
|
|
|
|
|
|
|
Dipasree Pal , Prasenjit Majumder , Mandar Mitra , Sukanya Mitra , Aparajita Sen, Issues in searching for Indian language web content, Proceeding of the 2nd ACM workshop on Improving non english web searching, October 30-30, 2008, Napa Valley, California, USA
|
|