| Modeling hidden topics on document manifold |
| Full text |
Pdf
(207 KB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceeding of the 17th ACM conference on Information and knowledge management
table of contents
Napa Valley, California, USA
SESSION: IR: medley
table of contents
Pages 911-920
Year of Publication: 2008
ISBN:978-1-59593-991-3
|
|
Authors
|
|
Deng Cai
|
University of Illinois at Urbana Champaign, Urbana, USA
|
|
Qiaozhu Mei
|
University of Illinois at Urbana Champaign, Urbana, USA
|
|
Jiawei Han
|
University of Illinois at Urbana Champaign, Urbana, USA
|
|
Chengxiang Zhai
|
University of Illinois at Urbana Champaign, Urbana, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 19, Downloads (12 Months): 210, Citation Count: 1
|
|
|
ABSTRACT
Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specifically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimental results show that LapPLSI provides better representation in the sense of semantic structure.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, pages 585--591. MIT Press, Cambridge, MA, 2001.
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
F. R. K. Chung. Spectral Graph Theory, volume 92 of Regional Conference Series in Mathematics. AMS, 1997.
|
| |
8
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
|
| |
9
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
L. Lovasz and M. Plummer. Matching Theory. Akadémiai Kiadó, North Holland, Budapest, 1986.
|
| |
14
|
|
| |
15
|
A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856. MIT Press, Cambridge, MA, 2001.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
L. Si and R. Jin. Adjusting mixture weights of gaussian mixture model via regularized probabilistic latent semantic analysis. In The Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), 2005.
|
 |
20
|
Xuanhui Wang , Jian-Tao Sun , Zheng Chen , ChengXiang Zhai, Latent semantic analysis for multiple-type interrelated data objects, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
[doi> 10.1145/1148170.1148214]
|
 |
21
|
|
| |
22
|
H. Zha, C. Ding, M. Gu, X. He, , and H. Simon. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems 14, pages 1057--1064. MIT Press, Cambridge, MA, 2001.
|
 |
23
|
|
 |
24
|
|
CITED BY
|
|
Deng Cai , Xuanhui Wang , Xiaofei He, Probabilistic dyadic data analysis with local and global consistency, Proceedings of the 26th Annual International Conference on Machine Learning, p.105-112, June 14-18, 2009, Montreal, Quebec, Canada
|
|