ACM Home Page
Please provide us with feedback. Feedback
Topic modeling: beyond bag-of-words
Full text PdfPdf (230 KB)
Source ACM International Conference Proceeding Series; Vol. 148 archive
Proceedings of the 23rd international conference on Machine learning table of contents
Pittsburgh, Pennsylvania
Pages: 977 - 984  
Year of Publication: 2006
ISBN:1-59593-383-2
Author
Hanna M. Wallach  University of Cambridge, Cambridge, UK
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): n/a,   Downloads (12 Months): n/a,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1143844.1143967
What is a DOI?

ABSTRACT

Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50, 5--43.
 
2
 
3
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228--5235.
 
4
Griffiths, T. L., & Steyvers, M. (2005). Topic modeling toolbox. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.
 
5
Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2004). Integrating topics and syntax. Advances in Neural Information Processing Systems.
 
6
Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. Gelsema and L. Kanal (Eds.), Pattern recognition in practice, 381--402. North-Holland publishing company.
 
7
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773--795.
 
8
MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1, 289--307.
 
9
Minka, T. P. (2003). Estimating a Dirichlet distribution. http://research.microsoft.com/~minka/papers/dirichlet/.
 
10
Rennie, J. (2005). 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/.