|
ABSTRACT
In this paper we describe a statistical technique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our data, the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment computation is fast and therefore practical for application to very large collections of text. We have used this technique to align several million sentences in the English-French Hansard corpora and have achieved an accuracy in excess of 99% in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96% and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
{Baum, 1972} Baum, L. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3:1--8.
|
| |
2
|
|
| |
3
|
{Brown et al., 1982} Brown, P., Spohrer, J., Hochschild, P., and Baker, J. (1982). Partial traceback and dynamic programming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1629--1632, Paris, France.
|
| |
4
|
Peter F. Brown , John Cocke , Stephen A. Della Pietra , Vincent J. Della Pietra , Fredrick Jelinek , John D. Lafferty , Robert L. Mercer , Paul S. Roossin, A statistical approach to machine translation, Computational Linguistics, v.16 n.2, p.79-85, June 1990
|
| |
5
|
P. Brown , J. Cocke , S. Della Pietra , V. Della Pietra , F. Jelinek , R. Mercer , P. Roossin, A statistical approach to language translation, Proceedings of the 12th conference on Computational linguistics, p.71-76, August 22-27, 1988, Budapest, Hungry
[doi> 10.3115/991635.991651]
|
| |
6
|
{Catizone et al., 1989} Catizone, R., Russell, G., and Warwick, S. (1989). Deriving translation data from bilingual texts. In Proceedings of the First International Acquisition Workshop, Detroit, Michigan.
|
| |
7
|
{Dempster et al., 1977} Dempster, A., Laird, N., and Rubin, d. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1--38.
|
| |
8
|
|
| |
9
|
{Kay, 1991} Kay, M. (1991). Text-translation alignment. In ACH/ALLC '91": "Making Connections" Conference Handbook, Tempe, Arizona.
|
| |
10
|
|
| |
11
|
{Sadler, 1989} Sadler, V. (1989). The Bilingual Knowledge Bank---A New Conceptual Basis for MT. BSO/Research, Utrecht.
|
| |
12
|
{Warwick and Russell, 1990} Warwick, S. and Russell, G. (1990). Bilingual concordancing and bilingual lexicography. In EURALEX 4th International Congress, Málaga, Spain.
|
CITED BY 77
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Pierre Isabelle , Marc Dymetman , George Foster , Jean-Marc Jutras , Elliott Macklovitch , Francois Perrault , Xiaobo Ren , Michel Simard, Translation analysis and translation automation, Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing, October 24-28, 1993, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Raquel Martínez , Joseba Abaitua , Arantza Casillas, Bitext correspondences through rich mark-up, Proceedings of the 17th international conference on Computational linguistics, p.812-818, August 10-14, 1998, Montreal, Quebec, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Takehito Utsuro , Hiroshi Ikeda , Masaya Yamane , Yuji Matsumoto , Makoto Nagao, Bilingual text, matching using bilingual dictionary and statistics, Proceedings of the 15th conference on Computational linguistics, August 05-09, 1994, Kyoto, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ting Liu , Ming Zhou , Jianfeng Gao , Endong Xun , Changning Huang, PENS: a machine-aided english writing system for Chinese users, Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, p.529-536, October 03-06, 2000, Hong Kong
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peter F. Brown , Stephen A. Della Pietra , Vincent J. Della Pietra , Robert L. Mercer , Surya Mohanty, Dividing and conquering long sentences in a translation system, Proceedings of the workshop on Speech and Natural Language, February 23-26, 1992, Harriman, New York
|
|
|
|
|
|
|
|
|
|
|
|
Bing Zhao , Klaus Zechner , Stephan Vogel , Alex Waibel, Efficient optimization for bilingual sentence alignment based on linear regression, Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond, p.81-87, May 31-31, 2003, Edmonton, Canada
|
|
|
|
|
|
Lei Shi , Cheng Niu , Ming Zhou , Jianfeng Gao, A DOM tree alignment model for mining parallel data from the web, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, p.489-496, July 17-18, 2006, Sydney, Australia
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|