ACM Home Page
Please provide us with feedback. Feedback
Statistical transliteration for english-arabic cross language information retrieval
Full text PdfPdf (312 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the twelfth international conference on Information and knowledge management table of contents
New Orleans, LA, USA
SESSION: Information retrieval session 3: cross language retrieval table of contents
Pages: 139 - 146  
Year of Publication: 2003
ISBN:1-58113-723-0
Authors
Nasreen AbdulJaleel  University of Massachusetts, Amhurst, MA
Leah S. Larkey  University of Massachusetts, Amhurst, MA
Sponsors
ACM: Association for Computing Machinery
SIGMIS: ACM Special Interest Group on Management Information Systems
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 107,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956863.956890
What is a DOI?

ABSTRACT

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of the second language. In the present study, we present a simple statistical technique to train an English to Arabic transliteration model from pairs of names. We call this a selected n-gram model because a two-stage training procedure first learns which n-gram segments should be added to the unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristics or linguistic knowledge of either language. We evaluate the statistically-trained model and a simpler hand-crafted model on a test set of named entities from the Arabic AFP corpus and demonstrate that they perform better than two online translation sources. We also explore the effectiveness of these systems on the TREC 2002 cross language IR task. We find that transliteration either of OOV named entities or of all OOV words is an effective approach for cross language IR.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Ajeeb online translation engine. http://tarjim.ajeeb.com/ajeeb/
 
2
Al Misbar. http://www.almisbar.com/salam_trans.html
 
3
 
4
Arabic Proper Names Dictionary from NMSU. http://crl.nmsu.edu/ahmed/downloads.html
 
5
 
6
Automatically-trained Transliteration Model. http://www.cs.umass.edu/nasreen/automatic_model.txt
7
 
8
Davis, M. W. and Ogden, W. C. Free resources and advanced alignment for cross-language text retrieval. In Proceedings of the sixth text retrieval conference (TREC-6), E. M. Voorhees and D. K. Harman (eds.). Gaithersburg: NIST Special Publication 500-240, 385--394, 1998.
 
9
Darwish, Kareem, David Doermann, Ryan Jones, Douglas Oard and Mika Rautiainen. 2001. TREC-10 experiments at Maryland: CLIR and video. In TREC 2001. Gaithersburg: NIST. http://trec.nist.gov/pubs/trec10/t10_proceedings.html
 
10
Fujii, Atsushi and Tetsuya, Ishikawa. Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration. Computers and the Humanities, Vol.35, No.4, pp.389--420, 2001
 
11
Gey, F. C. and Oard, D. W. 2001. The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French, or Arabic queries. In TREC 2001. Gaithersburg: NIST. http://trec.nist.gov/pubs/trec10/t10_proceedings.html
 
12
GIZA++. http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html
 
13
 
14
Larkey, L. S., Allan, J., Connell, M. E., Bolivar, A., & Wade, C. UMass at TREC 2002: Cross language and novelty tracks, to appear in The Eleventh Text REtrieval Conference (TREC 2002). Gaithersburg: NIST, 2003.
 
15
Larkey, L. S., & Connell, M. E. Arabic Information Retrieval at UMass in TREC-10, The Tenth Text Retrieval Conference, TREC 2001. Gaithersburg: NIST, 562--570, 2002.
 
16
Larkey, Leah, Nasreen AbdulJaleel, and Margaret Connell. 2003. What's in a Name?: Proper Names in Arabic Cross Language Information Retrieval, CIIR Technical Report, IR-278 .
 
17
18
 
19
Sakhr multilingual dictionary at http://dictionary.ajeeb.com/en.htm
 
20
Stalls, Bonnie Glover and Kevin Knight. 1998. Translating names and technical terms in Arabic text. http://citeseer.nj.nec.com/glover98translating.html
 
21
Whitaker, B. Arabic words and the Roman alphabet. http://www.al-bab.com/arab/language/roman1.ht
 
22
World cities. http://www.fourmilab.ch/earthview/cities.html

CITED BY  16

Collaborative Colleagues:
Nasreen AbdulJaleel: colleagues
Leah S. Larkey: colleagues