ACM Home Page
Please provide us with feedback. Feedback
GaP: a factor model for discrete data
Full text PdfPdf (172 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Sheffield, United Kingdom
SESSION: Formal models-2 table of contents
Pages: 122 - 129  
Year of Publication: 2004
ISBN:1-58113-881-4
Author
John Canny  University of California, Berkeley, CA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 72,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1008992.1009016
What is a DOI?

ABSTRACT

We present a probabilistic model for a document corpus that combines many of the desirable features of previous models. The model is called "GaP" for Gamma-Poisson, the distributions of the first and last random variable. GaP is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices Λ and X. These factors have strictly non-negative terms. GaP is a generative probabilistic model that assigns finite probabilities to documents in a corpus. It can be computed with an efficient and simple EM recurrence. For a suitable choice of parameters, the GaP factorization maximizes independence between the factors. So it can be used as an independent-component algorithm adapted to document data. The form of the GaP model is empirically as well as analytically motivated. It gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. The GaP model projects documents and terms into a low-dimensional space of "themes," and models texts as "passages" of terms on the same theme.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. MIT Press, 2002.
 
2
3
 
4
A. Hyvärinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Networks, 10(3):626--634, 1999.
 
5
 
6
T. K. Landauer and S. T. Dumais. A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104:211--240, 1997.
 
7
T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to Latent Semantic Analysis. Discourse Processes, 25:259--284, 1998.
 
8
 
9
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556--562, 2000.
10
11

CITED BY  10