ACM Home Page
Please provide us with feedback. Feedback
Modeling hidden topics on document manifold
Full text PdfPdf (207 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: IR: medley table of contents
Pages 911-920  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Deng Cai  University of Illinois at Urbana Champaign, Urbana, USA
Qiaozhu Mei  University of Illinois at Urbana Champaign, Urbana, USA
Jiawei Han  University of Illinois at Urbana Champaign, Urbana, USA
Chengxiang Zhai  University of Illinois at Urbana Champaign, Urbana, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 210,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458202
What is a DOI?

ABSTRACT

Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specifically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimental results show that LapPLSI provides better representation in the sense of semantic structure.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, pages 585--591. MIT Press, Cambridge, MA, 2001.
 
4
 
5
 
6
 
7
F. R. K. Chung. Spectral Graph Theory, volume 92 of Regional Conference Series in Mathematics. AMS, 1997.
 
8
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
 
9
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.
10
11
 
12
 
13
L. Lovasz and M. Plummer. Matching Theory. Akadémiai Kiadó, North Holland, Budapest, 1986.
 
14
 
15
A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856. MIT Press, Cambridge, MA, 2001.
 
16
 
17
 
18
 
19
L. Si and R. Jin. Adjusting mixture weights of gaussian mixture model via regularized probabilistic latent semantic analysis. In The Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), 2005.
20
21
 
22
H. Zha, C. Ding, M. Gu, X. He, , and H. Simon. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems 14, pages 1057--1064. MIT Press, Cambridge, MA, 2001.
23
24


Collaborative Colleagues:
Deng Cai: colleagues
Qiaozhu Mei: colleagues
Jiawei Han: colleagues
Chengxiang Zhai: colleagues