|
ABSTRACT
Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Azzopardi, L., Girolami, M and van Rijsbergen, C.J. Topic Based Language Models for ad hoc Information Retrieval. In Proceedings of the International Joint Conference on Neural Networks, Budapest,Hungary, 2004.
|
 |
2
|
|
| |
3
|
|
| |
4
|
Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems 16, Cambridge, MA, MIT Press, 2004.
|
| |
5
|
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 1990, 391--407.
|
| |
6
|
Geman, S., and Geman, D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 1984, 721--741.
|
| |
7
|
Girolami, M. and Kaban, A. Sequential activity profiling: latent Dirichlet allocation of Markov chains. Data Mining and Knowledge Discovery, 10, 2005, 175--196.
|
 |
8
|
|
| |
9
|
Griffiths, T. L., and Steyvers, M. Finding scientific topics. In Proceeding of the National Academy of Sciences, 2004, 5228--5235.
|
| |
10
|
Griffiths, T. L., Steyvers, M., Blei, D. and Tenenbaum, J. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, 2005
|
 |
11
|
|
 |
12
|
|
 |
13
|
|
 |
14
|
|
| |
15
|
McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 workshop on Text Learning, 1999.
|
 |
16
|
|
| |
17
|
Michal Rosen-Zvi , Thomas Griffiths , Mark Steyvers , Padhraic Smyth, The author-topic model for authors and documents, Proceedings of the 20th conference on Uncertainty in artificial intelligence, p.487-494, July 07-11, 2004, Banff, Canada
|
| |
18
|
Sparck Jones, K. Automatic keyword classification for information retrieval. Butterworths, London, 1971.
|
| |
19
|
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Technical Report, Department of Statistics, UC Berkeley, 2004.
|
 |
20
|
|
CITED BY 29
|
|
|
|
|
|
|
|
|
|
|
Yang Song , Jian Huang , Isaac G. Councill , Jia Li , C. Lee Giles, Efficient topic-based unsupervised name disambiguation, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jie Tang , Jing Zhang , Limin Yao , Juanzi Li , Li Zhang , Zhong Su, ArnetMiner: extraction and mining of academic social networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
Ian Porteous , David Newman , Alexander Ihler , Arthur Asuncion , Padhraic Smyth , Max Welling, Fast collapsed gibbs sampling for latent dirichlet allocation, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
Juan Cao , Tian Xia , Jintao Li , Yongdong Zhang , Sheng Tang, A density-based method for adaptive LDA model selection, Neurocomputing, v.72 n.7-9, p.1775-1781, March, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Honglei Guo , Huijia Zhu , Zhili Guo , XiaoXun Zhang , Zhong Su, Address standardization with latent semantic association, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|