|
ABSTRACT
An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different training corpora - on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Carbonell J. G, Yang, Y., Frederking, R. E., Brown, R., Geng, Y., Lee, D. Translingual Information Retrieval: A Comparative Evaluation. In Proceedings of the IJCAI (1) 1997: 708--715.
|
 |
3
|
|
| |
4
|
Darwish, K. and Oard, D. CLIR Experiments at Maryland for TREC-2002: Evidence Combination for Arabic-English Retrieval. In TREC 2002 Proceedings.
|
| |
5
|
Franz, M., McCarley, J. S, and Roukos, S. Ad hoc and multilingual information retrieval at IBM. In The Seventh Text REtrieval Conference, pages 157--168, November 1998. NIST Special Publication 500--242.
|
| |
6
|
Franz, M. and McCarley, J.S. Arabic Information Retrieval at IBM. In TREC 2002 proceedings.
|
| |
7
|
Fraser, A., Xu, J., Weischedel, R. 2002. TREC 2002 Cross-lingual Retrieval at BBN. In TREC 2002 proceedings.
|
| |
8
|
Gey, F. and Jiang H. 1999. English-German cross-language retrieval for the GIRT collection -- Exploiting a multilingual thesaurus. In TREC-8 proceedings.
|
| |
9
|
Kando, N. Overview of the Third NTCIR Workshop. Working notes of the Third NTCIR Workshop Meeting. Part I:Overview. Tokyo. Japan. October 2002. p.1--16.
|
| |
10
|
Khudanpur, S., Kim, W., 2002. Using cross-language cues for story-specific language modeling. In Proceedings of the International Conference on Spoken Language Processing, p. 513--516.
|
| |
11
|
Khudanpur , S. Kim, W., 1999. A maximum entropy language model to integrate n-grams and topic dependencies for conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 553--556.
|
| |
12
|
|
| |
13
|
Koehn, P. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Draft, Unpublished.
|
| |
14
|
Nie, J. Y., Simard, M. and Foster, G.. Using parallel web pages for multi-lingual IR. In C. Peters(Ed.), Proceedings of the CLEF 2000 evaluation forum.
|
| |
15
|
Oard, D. W. and F. Gey, The TREC-2002 Arabic/English CLIR Track. In TREC 2002 proceedings.
|
| |
16
|
Oard, D. When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. Cross-Language Information Retrieval: A Research Roadmap. Workshop at SIGIR-2002, Tampere Finland August 15, 2002.
|
| |
17
|
|
| |
18
|
Ogilvie, P. and Callan, J. Experiments using the Lemur toolkit. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). (2001).
|
| |
19
|
Peters, C. Results of the CLEF 2003 Cross-Language System Evaluation Campaign. Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway.
|
| |
20
|
|
| |
21
|
Rogati, M and Yang, Y. Multilingual Information Retrieval using Open, Transparent Resources in CLEF 2003 . In C. Peters (Ed.), Results of the CLEF2003 cross-language evaluation forum.
|
| |
22
|
|
| |
23
|
Seymore, K., Rosenfeld, R. 1997. Using story topics for language model adaptation. In Proceedings of the European Conference on Speech Communication and Technology.
|
|