ACM Home Page
Please provide us with feedback. Feedback
Efficient methods for topic model inference on streaming document collections
Full text MovMov (12:43),  PdfPdf (1.83 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Research track papers table of contents
Pages 937-946  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Limin Yao  University of Massachusetts, Amherst, Amherst, MA, USA
David Mimno  University of Massachusetts, Amherst, Amherst, MA, USA
Andrew McCallum  University of Massachusetts, Amherst, Amherst, MA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 101,   Downloads (12 Months): 221,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557121
What is a DOI?

ABSTRACT

Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Banerjee and S. Basu. Topic models over text streams: A study of batch and online unsupervised learning. In SIAM-DM, 2007.
 
2
 
3
 
4
T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(Suppl. 1):5228--5235, 2004.
 
5
A. K. McCallum. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
 
6
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, 2007.
7
8
 
9
Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In NIPS, 2006.
 
10
X. Wei, J. Sun, and X. Wang. Dynamic mixture models for multiple time series. In IJCAI, 2007.

Collaborative Colleagues:
Limin Yao: colleagues
David Mimno: colleagues
Andrew McCallum: colleagues