ACM Home Page
Please provide us with feedback. Feedback
Resource selection for domain-specific cross-lingual IR
Full text PdfPdf (149 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Sheffield, United Kingdom
SESSION: Cross-language information retrieval table of contents
Pages: 154 - 161  
Year of Publication: 2004
ISBN:1-58113-881-4
Authors
Monica Rogati  Carnegie Mellon University, Pittsburgh, PA
Yiming Yang  Carnegie Mellon University, Pittsburgh, PA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 52,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1008992.1009021
What is a DOI?

ABSTRACT

An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different training corpora - on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Carbonell J. G, Yang, Y., Frederking, R. E., Brown, R., Geng, Y., Lee, D. Translingual Information Retrieval: A Comparative Evaluation. In Proceedings of the IJCAI (1) 1997: 708--715.
3
 
4
Darwish, K. and Oard, D. CLIR Experiments at Maryland for TREC-2002: Evidence Combination for Arabic-English Retrieval. In TREC 2002 Proceedings.
 
5
Franz, M., McCarley, J. S, and Roukos, S. Ad hoc and multilingual information retrieval at IBM. In The Seventh Text REtrieval Conference, pages 157--168, November 1998. NIST Special Publication 500--242.
 
6
Franz, M. and McCarley, J.S. Arabic Information Retrieval at IBM. In TREC 2002 proceedings.
 
7
Fraser, A., Xu, J., Weischedel, R. 2002. TREC 2002 Cross-lingual Retrieval at BBN. In TREC 2002 proceedings.
 
8
Gey, F. and Jiang H. 1999. English-German cross-language retrieval for the GIRT collection -- Exploiting a multilingual thesaurus. In TREC-8 proceedings.
 
9
Kando, N. Overview of the Third NTCIR Workshop. Working notes of the Third NTCIR Workshop Meeting. Part I:Overview. Tokyo. Japan. October 2002. p.1--16.
 
10
Khudanpur, S., Kim, W., 2002. Using cross-language cues for story-specific language modeling. In Proceedings of the International Conference on Spoken Language Processing, p. 513--516.
 
11
Khudanpur , S. Kim, W., 1999. A maximum entropy language model to integrate n-grams and topic dependencies for conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 553--556.
 
12
 
13
Koehn, P. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Draft, Unpublished.
 
14
Nie, J. Y., Simard, M. and Foster, G.. Using parallel web pages for multi-lingual IR. In C. Peters(Ed.), Proceedings of the CLEF 2000 evaluation forum.
 
15
Oard, D. W. and F. Gey, The TREC-2002 Arabic/English CLIR Track. In TREC 2002 proceedings.
 
16
Oard, D. When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. Cross-Language Information Retrieval: A Research Roadmap. Workshop at SIGIR-2002, Tampere Finland August 15, 2002.
 
17
 
18
Ogilvie, P. and Callan, J. Experiments using the Lemur toolkit. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). (2001).
 
19
Peters, C. Results of the CLEF 2003 Cross-Language System Evaluation Campaign. Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway.
 
20
 
21
Rogati, M and Yang, Y. Multilingual Information Retrieval using Open, Transparent Resources in CLEF 2003 . In C. Peters (Ed.), Results of the CLEF2003 cross-language evaluation forum.
 
22
 
23
Seymore, K., Rosenfeld, R. 1997. Using story topics for language model adaptation. In Proceedings of the European Conference on Speech Communication and Technology.


Collaborative Colleagues:
Monica Rogati: colleagues
Yiming Yang: colleagues