| First large-scale information retrieval experiments on turkish texts |
| Full text |
Pdf
(120 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Seattle, Washington, USA
POSTER SESSION: Posters
table of contents
Pages: 627 - 628
Year of Publication: 2006
ISBN:1-59593-369-7
|
|
Authors
|
|
Fazli Can
|
Bilkent University, Bilkent, Turkey
|
|
Seyit Kocberber
|
Bilkent University, Bilkent, Turkey
|
|
Erman Balcik
|
Bilkent University, Bilkent, Turkey
|
|
Cihan Kaynak
|
Bilkent University, Bilkent, Turkey
|
|
H. Cagdas Ocalan
|
Bilkent University, Bilkent, Turkey
|
|
Onur M. Vursavas
|
Bilkent University, Bilkent, Turkey
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 9, Downloads (12 Months): 81, Citation Count: 1
|
|
|
ABSTRACT
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching functions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Altintas, K., Can, F., Patton, J. M. Language change quantification using time-separated parallel translations. Literary and Linguistic Computing (resubmitted after rev.).
|
 |
2
|
|
| |
3
|
Hafer, M. A., Weiss, S. F. Word segmentation by letter successor varieties. Infor. Stor. Retr. 10, 371--385, 1974.
|
| |
4
|
|
| |
5
|
Sever, H., Bitirim Y. FindStem: analysis and evaluation of a Turkish stemming algorithm. LNCS 2857: 238--251, 2003.
|
| |
6
|
Sever, H., Tonta, Y. Truncation of content terms for Turkish. CICLing Feb. 2006, Mexico (to appear).
|
| |
7
|
Solak, A., Can, F., Effects of stemming on Turkish text retrieval. ISCIS Conf., pp. 49--56, 1994.
|
| |
8
|
|
CITED BY
|
|
Ismail Sengor Altingovde , Rifat Ozcan , Huseyin Cagdas Ocalan , Fazli Can , Özgür Ulusoy, Large-scale cluster-based retrieval experiments on Turkish texts, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|