ACM Home Page
Please provide us with feedback. Feedback
Cross-language information retrieval using PARAFAC2
Full text MovMov (17:52),  PdfPdf (1.24 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 143 - 152  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Peter A. Chew  Sandia National Laboratories
Brett W. Bader  Sandia National Laboratories
Tamara G. Kolda  Sandia National Laboratories
Ahmed Abdelali  New Mexico State University
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 133,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281211
What is a DOI?

ABSTRACT

A standard approach to cross-language information retrieval (CLIR) uses Latent Semantic Analysis (LSA) in conjunction with a multilingual parallel aligned corpus. This approach has been shown to be successful in identifying similar documents across languages - or more precisely, retrieving the most similar document in one language to a query in another language. However, the approach has severe drawbacks when applied to a related task, that of clustering documents "language-independently", so that documents about similar topics end up closest to one another in the semantic space regardless of their language. The problem is that documents are generally more similar to other documents in the same language than they are to documents in a different language, but on the same topic. As a result, when using multilingual LSA, documents will in practice cluster by language, not by topic.

We propose a novel application of PARAFAC2 (which is a variant of PARAFAC, a multi-way generalization of the singular value decomposition [SVD]) to overcome this problem. Instead of forming a single multilingual term-by-document matrix which, under LSA, is subjected to SVD, we form an irregular three-way array, each slice of which is a separate term-by-document matrix for a single language in the parallel corpus. The goal is to compute an SVD for each language such that V (the matrix of right singular vectors) is the same across all languages. Effectively, PARAFAC2 imposes the constraint, not present in standard LSA, that the "concepts" in all documents in the parallel corpus are the same regardless of language. Intuitively, this constraint makes sense, since the whole purpose of using a parallel corpus is that exactly the same concepts are expressed in the translations.

We tested this approach by comparing the performance of PARAFAC2 with standard LSA in solving a particular CLIR problem. From our results, we conclude that PARAFAC2 offers a very promising alternative to LSA not only for multilingual document clustering, but also for solving other problems in cross-language information retrieval.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bader, B. W., and Kolda, T. G. Efficient MATLAB computations with sparse and factored tensors. Technical Report SAND2006-7592, Sandia National Laboratories, Albuquerque, NM and Livermore, CA, Dec. 2006.
 
2
Bader, B. W., and Kolda, T. G. MATLAB Tensor Toolbox, version 2.2. http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/, February 2007.
 
3
Baker, C. G., Hetmaniuk, U. L., Lehoucq, R. B., and Thornquist, H. K. Anasazi: Block Eigensolver Package Web Site: http://software.sandia.gov/trilinos/packages/anasazi /, 2007.
 
4
Berry, M. W., Do, T., O'Brien, G. Krishna, V., and Varadhan, S. SVDPACKC (Version 1.0) User's Guide. Knoxville, TN: University of Tennessee, 1996.
 
5
 
6
Bible Society. A Statistical Summary of Languages with the Scriptures. Accessed at http://www.biblesociety.org/latestnews/latest390-slr2006stats.html on February 27, 2007.
 
7
Biola University. The Unbound Bible, 2005-2006. Accessed at http://www.unboundbible.com/ on February 27, 2007.
 
8
Chew, P. A., and Abdelali, A. Benefits of the. Massively Parallel Rosetta Stone': Cross-Language Information Retrieval with over 30 Languages, forthcoming.
 
9
Chew, P. A., Verzi, S. J., Bauer, T. L., and McClain, J. T. Evaluation of the Bible as a Resource for Cross-Language Information Retrieval. Proceedings of the Workshop on Multilingual Language Resources and Interoperability, 2006, 68--74.
 
10
Dumais, S. T. Improving the Retrieval of Information from External Sources. Behavior Research Methods, Instruments, and Computers 23 (2), 1991, 229--236.
11
 
12
Harshman, R. A. Foundations of the PARAFAC Procedure: Models and Conditions for an "Explanatory" Multi-Modal Factor Analysis. UCLA Working Papers in Phonetics 16, 1970, 1--84.
 
13
Harshman, R. A. PARAFAC2: Mathematical and Technical Notes. UCLA Working Papers in Phonetics 22, 1972, 30--47.
14
 
15
Kiers, H. A. L., Ten Berge, J. M. F., and Bro, R. PARAFAC2 - Part 1. A Direct Fitting Algorithm for the PARAFAC2 Model. Journal of Chemometrics 13, 1999, 275--294.
 
16
Kolda, T. G. and Bader, B. W. The TOPHITS model for web link analysis. In Workshop on Link Analysis, Counterterrorism and Security, 2006.
 
17
Landauer, T. An Introduction to Latent Semantic Analysis. Discourse Processes 25, 1998, 259--284.
 
18
Mathieu, B., Besancon, R. and Fluhr, C. Multilingual Document Clusters Discovery. Recherche d'Information Assistée par Ordinateur (RIAO) Proceedings, 2004, 1--10.
 
19
 
20
Nie, J-Y. and Jin, F. A Multilingual Approach to Multilingual Information Retrieval. Proceedings of the Cross-Language Evaluation Forum, 2003, 101--110. Berlin: Springer-Verlag.
 
21
Peters, C. (ed.). Cross-Language Information Retrieval and Evaluation: Workshop of the Cross-Language Evaluation Forum, CLEF 2000. Berlin: Springer-Verlag. 2001.
 
22
Resnik, P., Olsen, M. B., and Diab, M. The Bible as a Parallel Corpus: Annotating the "Book of 2000 Tongues". Computers and the Humanities, 33, 1999, 129--153.
 
23
Young, P. G. Cross Language Information Retrieval Using Latent Semantic Indexing. Master's thesis, University of Knoxville, Tennessee: Knoxville, TN, 1994.


Collaborative Colleagues:
Peter A. Chew: colleagues
Brett W. Bader: colleagues
Tamara G. Kolda: colleagues
Ahmed Abdelali: colleagues