|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
ABSTRACT
We present a probabilistic model for a document corpus that combines many of the desirable features of previous models. The model is called "GaP" for Gamma-Poisson, the distributions of the first and last random variable. GaP is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices Λ and X. These factors have strictly non-negative terms. GaP is a generative probabilistic model that assigns finite probabilities to documents in a corpus. It can be computed with an efficient and simple EM recurrence. For a suitable choice of parameters, the GaP factorization maximizes independence between the factors. So it can be used as an independent-component algorithm adapted to document data. The form of the GaP model is empirically as well as analytically motivated. It gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. The GaP model projects documents and terms into a low-dimensional space of "themes," and models texts as "passages" of terms on the same theme. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
INDEX TERMS
Primary Classification:
Additional Classification:
General Terms:
Keywords:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||