|
ABSTRACT
Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50, 5--43.
|
| |
2
|
|
| |
3
|
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228--5235.
|
| |
4
|
Griffiths, T. L., & Steyvers, M. (2005). Topic modeling toolbox. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.
|
| |
5
|
Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2004). Integrating topics and syntax. Advances in Neural Information Processing Systems.
|
| |
6
|
Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. Gelsema and L. Kanal (Eds.), Pattern recognition in practice, 381--402. North-Holland publishing company.
|
| |
7
|
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773--795.
|
| |
8
|
MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1, 289--307.
|
| |
9
|
Minka, T. P. (2003). Estimating a Dirichlet distribution. http://research.microsoft.com/~minka/papers/dirichlet/.
|
| |
10
|
Rennie, J. (2005). 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/.
|
CITED BY 6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hanna M. Wallach , Iain Murray , Ruslan Salakhutdinov , David Mimno, Evaluation methods for topic models, Proceedings of the 26th Annual International Conference on Machine Learning, p.1105-1112, June 14-18, 2009, Montreal, Quebec, Canada
|
|
|
Tomoharu Iwata , Shinji Watanabe , Takeshi Yamada , Naonori Ueda, Topic tracking model for analyzing consumer purchase behavior, Proceedings of the 21st international jont conference on Artifical intelligence, p.1427-1432, July 11-17, 2009, Pasadena, California, USA
|
|