ACM Home Page
Please provide us with feedback. Feedback
Accounting for burstiness in topic models
Full text PdfPdf (644 KB)
Source ACM International Conference Proceeding Series; Vol. 382 archive
Proceedings of the 26th Annual International Conference on Machine Learning table of contents
Montreal, Quebec, Canada
Pages 281-288  
Year of Publication: 2009
ISBN:978-1-60558-516-1
Authors
Gabriel Doyle  University of California, San Diego, La Jolla, CA
Charles Elkan  University of California, San Diego, La Jolla, CA
Sponsors
: MITACS
: NSF
Microsoft Research : Microsoft Research
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 34,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1553374.1553410
What is a DOI?

ABSTRACT

Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). It is straightforward to incorporate the DCM extension into topic models that are more complex than LDA.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Airoldi, E. M., Fienberg, S. E., & Xing, E. P. (2007). Mixed membership analysis of genome-wide expression data. Arxiv preprint arXiv:0711.2520.
 
2
Blei, D., & Lafferty, J. (2005). Correlated topic models. Advances in Neural Information Processing Systems 18 (pp. 147--154).
 
3
Blei, D., Ng, A., & Jordan, M. (2001). Latent Dirichlet allocation. Advances in Neural Information Processing Systems 14 (pp. 601--608).
 
4
 
5
Celeux, G., Chaveau, D., & Diebolt, J. (1996). Stochastic versions of the EM algorithm: An experimental study in the mixture case. J. of Statistical Computation and Simulation, 55, 287--314.
 
6
Church, K., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163--190.
7
 
8
 
9
Globerson, A., Chechik, G., Pereira, F., & Tishby, N. (2004). Euclidean embedding of co-occurrence data. Advances in Neural Information Processing Systems 17 (pp. 497--504).
 
10
Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 104, 5228--5235.
 
11
Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2004). Integrating topics and syntax. Advances in Neural Information Processing Systems 17 (pp. 537--544).
 
12
Heinrich, G. (2005). Parameter estimation for text analysis. Available at http://www.arbylon.net/publications/text-est.pdf.
13
 
14
Li, W., & McCallum, A. (2008). Pachinko allocation: Scalable mixture models of topic correlations. J. of Machine Learning Research. Submitted.
15
 
16
Newton, M., & Raftery, A. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society B, 56, 3--48.
 
17
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of 20th International Conference on Machine Learning (pp. 616--623).
18

Collaborative Colleagues:
Gabriel Doyle: colleagues
Charles Elkan: colleagues