|
ABSTRACT
A standard approach to cross-language information retrieval (CLIR) uses Latent Semantic Analysis (LSA) in conjunction with a multilingual parallel aligned corpus. This approach has been shown to be successful in identifying similar documents across languages - or more precisely, retrieving the most similar document in one language to a query in another language. However, the approach has severe drawbacks when applied to a related task, that of clustering documents "language-independently", so that documents about similar topics end up closest to one another in the semantic space regardless of their language. The problem is that documents are generally more similar to other documents in the same language than they are to documents in a different language, but on the same topic. As a result, when using multilingual LSA, documents will in practice cluster by language, not by topic. We propose a novel application of PARAFAC2 (which is a variant of PARAFAC, a multi-way generalization of the singular value decomposition [SVD]) to overcome this problem. Instead of forming a single multilingual term-by-document matrix which, under LSA, is subjected to SVD, we form an irregular three-way array, each slice of which is a separate term-by-document matrix for a single language in the parallel corpus. The goal is to compute an SVD for each language such that V (the matrix of right singular vectors) is the same across all languages. Effectively, PARAFAC2 imposes the constraint, not present in standard LSA, that the "concepts" in all documents in the parallel corpus are the same regardless of language. Intuitively, this constraint makes sense, since the whole purpose of using a parallel corpus is that exactly the same concepts are expressed in the translations. We tested this approach by comparing the performance of PARAFAC2 with standard LSA in solving a particular CLIR problem. From our results, we conclude that PARAFAC2 offers a very promising alternative to LSA not only for multilingual document clustering, but also for solving other problems in cross-language information retrieval.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bader, B. W., and Kolda, T. G. Efficient MATLAB computations with sparse and factored tensors. Technical Report SAND2006-7592, Sandia National Laboratories, Albuquerque, NM and Livermore, CA, Dec. 2006.
|
| |
2
|
Bader, B. W., and Kolda, T. G. MATLAB Tensor Toolbox, version 2.2. http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/, February 2007.
|
| |
3
|
Baker, C. G., Hetmaniuk, U. L., Lehoucq, R. B., and Thornquist, H. K. Anasazi: Block Eigensolver Package Web Site: http://software.sandia.gov/trilinos/packages/anasazi /, 2007.
|
| |
4
|
Berry, M. W., Do, T., O'Brien, G. Krishna, V., and Varadhan, S. SVDPACKC (Version 1.0) User's Guide. Knoxville, TN: University of Tennessee, 1996.
|
| |
5
|
|
| |
6
|
Bible Society. A Statistical Summary of Languages with the Scriptures. Accessed at http://www.biblesociety.org/latestnews/latest390-slr2006stats.html on February 27, 2007.
|
| |
7
|
Biola University. The Unbound Bible, 2005-2006. Accessed at http://www.unboundbible.com/ on February 27, 2007.
|
| |
8
|
Chew, P. A., and Abdelali, A. Benefits of the. Massively Parallel Rosetta Stone': Cross-Language Information Retrieval with over 30 Languages, forthcoming.
|
| |
9
|
Chew, P. A., Verzi, S. J., Bauer, T. L., and McClain, J. T. Evaluation of the Bible as a Resource for Cross-Language Information Retrieval. Proceedings of the Workshop on Multilingual Language Resources and Interoperability, 2006, 68--74.
|
| |
10
|
Dumais, S. T. Improving the Retrieval of Information from External Sources. Behavior Research Methods, Instruments, and Computers 23 (2), 1991, 229--236.
|
 |
11
|
S. T. Dumais , G. W. Furnas , T. K. Landauer , S. Deerwester , R. Harshman, Using latent semantic analysis to improve access to textual information, Proceedings of the SIGCHI conference on Human factors in computing systems, p.281-285, May 15-19, 1988, Washington, D.C., United States
[doi> 10.1145/57167.57214]
|
| |
12
|
Harshman, R. A. Foundations of the PARAFAC Procedure: Models and Conditions for an "Explanatory" Multi-Modal Factor Analysis. UCLA Working Papers in Phonetics 16, 1970, 1--84.
|
| |
13
|
Harshman, R. A. PARAFAC2: Mathematical and Technical Notes. UCLA Working Papers in Phonetics 22, 1972, 30--47.
|
 |
14
|
Michael A. Heroux , Roscoe A. Bartlett , Vicki E. Howle , Robert J. Hoekstra , Jonathan J. Hu , Tamara G. Kolda , Richard B. Lehoucq , Kevin R. Long , Roger P. Pawlowski , Eric T. Phipps , Andrew G. Salinger , Heidi K. Thornquist , Ray S. Tuminaro , James M. Willenbring , Alan Williams , Kendall S. Stanley, An overview of the Trilinos project, ACM Transactions on Mathematical Software (TOMS), v.31 n.3, p.397-423, September 2005
[doi> 10.1145/1089014.1089021]
|
| |
15
|
Kiers, H. A. L., Ten Berge, J. M. F., and Bro, R. PARAFAC2 - Part 1. A Direct Fitting Algorithm for the PARAFAC2 Model. Journal of Chemometrics 13, 1999, 275--294.
|
| |
16
|
Kolda, T. G. and Bader, B. W. The TOPHITS model for web link analysis. In Workshop on Link Analysis, Counterterrorism and Security, 2006.
|
| |
17
|
Landauer, T. An Introduction to Latent Semantic Analysis. Discourse Processes 25, 1998, 259--284.
|
| |
18
|
Mathieu, B., Besancon, R. and Fluhr, C. Multilingual Document Clusters Discovery. Recherche d'Information Assistée par Ordinateur (RIAO) Proceedings, 2004, 1--10.
|
| |
19
|
|
| |
20
|
Nie, J-Y. and Jin, F. A Multilingual Approach to Multilingual Information Retrieval. Proceedings of the Cross-Language Evaluation Forum, 2003, 101--110. Berlin: Springer-Verlag.
|
| |
21
|
Peters, C. (ed.). Cross-Language Information Retrieval and Evaluation: Workshop of the Cross-Language Evaluation Forum, CLEF 2000. Berlin: Springer-Verlag. 2001.
|
| |
22
|
Resnik, P., Olsen, M. B., and Diab, M. The Bible as a Parallel Corpus: Annotating the "Book of 2000 Tongues". Computers and the Humanities, 33, 1999, 129--153.
|
| |
23
|
Young, P. G. Cross Language Information Retrieval Using Latent Semantic Indexing. Master's thesis, University of Knoxville, Tennessee: Knoxville, TN, 1994.
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Subjects:
Clustering
Additional Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Subjects:
Retrieval models
H.3.4
Systems and Software
Subjects:
Performance evaluation (efficiency and effectiveness)
General Terms:
Algorithms,
Design,
Experimentation,
Languages,
Measurement,
Theory,
Verification
Keywords:
PARAFAC2,
clustering,
information retrieval,
latent semantic analysis (LSA),
multilingual
|