|
ABSTRACT
We propose a formal model of Cross-Language Information Retrieval that does not rely on either query translation or document translation. Our approach leverages recent advances in language modeling to directly estimate an accurate topic model in the target language, starting with a query in the source language. The model integrates popular techniques of disambiguation and query expansion in a unified formal framework. We describe how the topic model can be estimated with either a parallel corpus or a dictionary. We test the framework by constructing Chinese topic models from English queries and using them in the CLIR task of TREC9. The model achieves performance around 95% of the strong mono-lingual baseline in terms of average precision. In initial precision, our model outperforms the mono-lingual baseline by 20%. The main contribution of this work is the unified formal model which integrates techniques that are essential for effective Cross-Language Retrieval.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Allan, J. Callan, W. B. Croft, L. A. Ballesteros, D. Byrd, R. Swan, and J. Xu. INQUERY does battle with TREC-6. In E. M. Voorhees and D. K. Harman, editors, Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 169--206, Gaithersburg, MD, November 1997. National Institute of Standards and Technology (NIST) and Defense Advanced Research Projects Agency (DARPA), Department of Commerce, National Institute of Standards and Technology.
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
Peter F. Brown , John Cocke , Stephen A. Della Pietra , Vincent J. Della Pietra , Fredrick Jelinek , John D. Lafferty , Robert L. Mercer , Paul S. Roossin, A statistical approach to machine translation, Computational Linguistics, v.16 n.2, p.79-85, June 1990
|
| |
6
|
W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors. Proceedings of the Twenty-Fourth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, September 2001. ACM Press.
|
| |
7
|
J. Gao, J.-Y. Nie, J. Zhang, E. Xun, Y. Su, M. Zhou, and C. Huang. TREC-9 CLIR experiments at MSRCN. In Voorhees and Harman {17}, pages 343--354.
|
 |
8
|
|
| |
9
|
|
| |
10
|
D. Hiemstra, F. de Jong, and W. Kraaij. A domain specific lexicon acquisition tool for cross-language information retrieval. In L. Devroye and C. Chrisment, editors, Proceedings of the Fifth RIAO International Conference, pages 255--270, Montréal, Canada, 1997. Centre de Hautes Études Internationales d'Informatique Documentaire (C.I.D).
|
 |
11
|
John Lafferty , Chengxiang Zhai, Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.111-119, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383970]
|
 |
12
|
|
| |
13
|
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.
|
| |
14
|
S. E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294--304, 1977. Reprinted in {16}.
|
 |
15
|
|
| |
16
|
|
| |
17
|
E. M. Voorhees and D. K. Harman, editors. Proceedings of the Ninth Text REtrieval Conference (TREC-9), Gaithersburg, MD, November 2000. Department of Commerce, National Institute of Standards and Technology.
|
| |
18
|
J. Xu and R. Weischedel. TREC-9 cross-lingual retrieval at BBN. In Voorhees and Harman {17}, pages 106--116.
|
 |
19
|
|
CITED BY 32
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jenq-Haur Wang , Jei-Wen Teng , Pu-Jen Cheng , Wen-Hsiang Lu , Lee-Feng Chien, Translating unknown cross-lingual queries in digital libraries using a web-based approach, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
Pu-Jen Cheng , Jei-Wen Teng , Ruei-Cheng Chen , Jenq-Haur Wang , Wen-Hsiang Lu , Lee-Feng Chien, Translating unknown queries with web corpora for cross-language information retrieval, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anton Leuski , Jarrell Pair , David Traum , Peter J. McNerney , Panayiotis Georgiou , Ronakkumar Patel, How to talk to a hologram, Proceedings of the 11th international conference on Intelligent user interfaces, January 29-February 01, 2006, Sydney, Australia
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wei Gao , Cheng Niu , Jian-Yun Nie , Ming Zhou , Jian Hu , Kam-Fai Wong , Hsiao-Wuen Hon, Cross-lingual query suggestion using query logs of different languages, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
Krisztian Balog , Toine Bogers , Leif Azzopardi , Maarten de Rijke , Antal van den Bosch, Broad expertise retrieval in sparse data environments, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
Roelof van Zwol , Vanessa Murdock , Lluis Garcia Pueyo , Georgina Ramirez, Diversifying image search with user generated content, Proceeding of the 1st ACM international conference on Multimedia information retrieval, October 30-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Dan Wu , Daqing He , Heng Ji , Ralph Grishman, A study of using an out-of-box commercial MT system for query translation in CLIR, Proceeding of the 2nd ACM workshop on Improving non english web searching, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Rong Hu , Weizhu Chen , Jian Hu , Yansheng Lu , Zheng Chen , Qiang Yang, Mining translations of web queries from web click-through data, Proceedings of the 23rd national conference on Artificial intelligence, p.1144-1149, July 13-17, 2008, Chicago, Illinois
|
|