ACM Home Page
Please provide us with feedback. Feedback
Lexical triggers and latent semantic analysis for cross-lingual language model adaptation
Full text PdfPdf (256 KB)
Source ACM Transactions on Asian Language Information Processing (TALIP) archive
Volume 3 ,  Issue 2  (June 2004) table of contents
Pages: 94 - 112  
Year of Publication: 2004
ISSN:1530-0226
Authors
Woosung Kim  The Johns Hopkins University, Baltimore, MD
Sanjeev Khudanpur  The Johns Hopkins University, Baltimore, MD
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 41,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1034780.1034782
What is a DOI?

ABSTRACT

In-domain texts for estimating statistical language models are not easily found for most languages of the world. We present two techniques to take advantage of in-domain text resources in other languages. First, we extend the notion of <i>lexical triggers</i>, which have been used monolingually for language model adaptation, to the cross-lingual problem, permitting the construction of sharper language models for a target-language document by drawing statistics from related documents in a resource-rich language. Next, we show that <i>cross-lingual latent semantic analysis</i> is similarly capable of extracting useful statistics for language modeling. Neither technique requires explicit translation capabilities between the two languages! We demonstrate significant reductions in both perplexity and word error rate on a Mandarin speech recognition task by using these techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
Byrne, W. et al. 2000. Towards language independent acoustic modeling. In Proceedings of the ICASSP, vol. 2. 1029--1032.
 
4
Coccaro, N. and Jurafsky, D. 1998. Towards better integration of semantic predictors in statistical language modeling. In Proceedings of the ICSLP, Sydney, Australia, vol. 6. 2403--2406.
 
5
Doermann, D. et al. 2002. Lexicon acquisition from bilingual dictionaries. In Proceedings of the SPIE Photonic West Article Imaging Conference, San Jose, CA. 37--48.
 
6
Dumais, S. et al. 1997. Automatic cross-language retrieval using latent semantic indexing. In AAAI Spring Symposium on Cross-Language Text and Speech Retrieval.
 
7
Fung, P. et al. 2000. Pronunciation modeling of Mandarin casual speech. 2000 Johns Hopkins Summer Workshop. Available at http://www.clsp.jhu.edu/ws2000/groups/mcs.
 
8
Iyer, R. and Ostendorf, M. 1999. Modeling long-distance dependence in language: topic-mixtures vs dynamic cache models. IEEE Trans. Speech Audio Process. 7, 30--39.
 
9
Khudanpur, S. and Kim, W. 2002. Using cross-language cues for story-specific language modeling. In Proceedings of the ICSLP, Denver, CO, vol. 1. 513--516.
 
10
 
11
Kirchhoff, K. et al. 2002. Novel speech recognition models for Arabic. 2002 Johns Hopkins Summer Workshop. Available at http://www.clsp.jhu.edu/ws2002/groups/arabic.
 
12
LDC. 2000. Hong Kong news parallel text corpus. Available through the Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/LDC2000T46.html.
 
13
 
14
Pallett, D., Fisher, W., and Fiscus, J. 1990. Tools for the analysis of benchmark speech recognition tests. In Proceedings of the ICASSP, Alburquerque, NM, vol. 1. 97--100.
 
15
Rosenfeld, R. 1996. A maximum entropy approach to adaptive statistical language modeling. Comput. Speech Lang. 10, 187--228.
 
16
Schultz, T. and Waibel, A. 1998. Language independent and language adaptive large vocabulary speech recognition. In Proceedings of the ICSLP, Sydney, Australia, vol. 5. 1819--1822.
 
17
Tillmann, C. and Ney, H. 1997. Word trigger and the EM algorithm. In Proceedings of the Workshop Computational Natural Language Learning (CoNLL 97), Madrid, Spain. 117--124.
 
18

Collaborative Colleagues:
Woosung Kim: colleagues
Sanjeev Khudanpur: colleagues