|
ABSTRACT
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BROGLIO, J., CALLAN, J. P., AND CROFT, W. 1994. An overview of the INQUERY system as used for the TIPSTER project. In Proceedings of the TIPSTER Workshop. Morgan-Kaufmann, San Mateo, Calif., 47-67.
|
| |
2
|
BROGLIO, J., CALLAN, J. P., CROFT, W. B., AND NACHBAR, D.W. 1995. Document retrieval and routing using the INQUERY system. In Proceedings of the 3rd Text REtrieval Conference (TREC-3), D. Harman, Ed. NIST Special Publication 500-225, 22-29.
|
| |
3
|
Kenneth Ward Church , Patrick Hanks, Word association norms, mutual information, and lexicography, Proceedings of the 27th annual meeting on Association for Computational Linguistics, p.76-83, June 26-29, 1989, Vancouver, British Columbia, Canada
[doi> 10.3115/981623.981633]
|
| |
4
|
CROFT, W. B. AND XU, J. 1995. Corpus-specific stemming using word form co-occurrence. In the 4th Annual Symposium on Document Analysis and Information Retrieval. 147-159.
|
| |
5
|
HARMAN, D. 1991. How effective is suffixing? J. Am. Soc. Inf. Sci. 42, 1, 7-15.
|
| |
6
|
HARMAN, D. 1995. Overview of the third text REtrieval conference (TREC-3). In Proceedings of the 3rd Text REtrieval Conference (TREC-3), D. Harman, Ed. NIST Special Publication 500-225, 1-20.
|
 |
7
|
|
| |
8
|
|
| |
9
|
JING, Y. AND CROFT, W. 1994. An association thesaurus for information retrieval. In Proceedings of RIAO 94. 146-160.
|
 |
10
|
|
 |
11
|
|
| |
12
|
PoPovIc, M. AND WILLETT, P. 1992. The effectiveness of stemming for natural-language access to Slovene textual data. J. Am. Soc. Inf. Sci. 43, 5, 384-390.
|
| |
13
|
PORTER, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130-137.
|
| |
14
|
|
| |
15
|
SPARCK JONES, K. 1971. Automatic Keyword Classification for Information Retrieval. Archon Books, Hamden, Conn.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
CITED BY 41
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michel Galley , Kathleen McKeown , Eric Fosler-Lussier , Hongyan Jing, Discourse segmentation of multi-party conversation, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, p.562-569, July 07-12, 2003, Sapporo, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Patrick Ruch , Imad Tbahriti , Julien Gobeill , Alan R. Aronson, Argumentative feedback: a linguistically-motivated term expansion for information retrieval, Proceedings of the COLING/ACL on Main conference poster sessions, p.675-682, July 17-18, 2006, Sydney, Australia
|
|
|
|
|
|
|
|
|
|
|
|
Prasenjit Majumder , Mandar Mitra , Swapan K. Parui , Gobinda Kole , Pabitra Mitra , Kalyankumar Datta, YASS: Yet another suffix stripper, ACM Transactions on Information Systems (TOIS), v.25 n.4, p.18-es, October 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|