|
ABSTRACT
We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Buntine, W.L. (1994) Operations for learning with graphical models, Journal of Artificial Intelligence Research 2, pp. 159--225.
|
 |
3
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
4
|
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990) Indexing by latent semantic analysis, Journal of the American Society of Information Science, 41(6), pp. 391--407.
|
| |
5
|
|
| |
6
|
Erten, C., Harding, P. J., Kobourov, S. G., Wampler, K., and Yee, G. (2003) Exploring the computing literature using temporal graph visualization, Technical Report, Department of Computer Science, University of Arizona.
|
| |
7
|
Gray, A., Sallis, P., MacDonell, S. (1997) Software forensics: Extending authorship analysis techniques to computer programs, Proceedings of the 3rd Biannual Conference of the International Association of Forensic Linguists (IAFL), Durham NC.
|
| |
8
|
Griffiths, T. L., and Steyvers , M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228--5235.
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347123]
|
| |
14
|
Mosteller, F., and Wallace, D. (1964) Applied Bayesian and Classical Inference: The Case of the Federalist Papers, Springer-Verlag.
|
| |
15
|
Mutschke, P. (2003) Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks, Intelligent Data Analysis 2003, Lecture Notes in Computer Science 2810, Springer Verlag, pp. 155--166
|
| |
16
|
Newman, M. E. J. (2001) Scientific collaboration networks: I. Network construction and fundamental results, Physical Review E, 64, 016131.
|
| |
17
|
|
| |
18
|
Michal Rosen-Zvi , Thomas Griffiths , Mark Steyvers , Padhraic Smyth, The author-topic model for authors and documents, Proceedings of the 20th conference on Uncertainty in artificial intelligence, p.487-494, July 07-11, 2004, Banff, Canada
|
| |
19
|
Thisted, B., and Efron, R. (1987) Did Shakespeare write a newly discovered poem?, Biometrika, pp. 445--455.
|
 |
20
|
|
| |
21
|
|
CITED BY 34
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ding Zhou , Eren Manavoglu , Jia Li , C. Lee Giles , Hongyuan Zha, Probabilistic models for discovering e-communities, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
Ding Zhou , Xiang Ji , Hongyuan Zha , C. Lee Giles, Topic evolution and social interactions: how authors effect research, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
Chaomei Chen , Jian Zhang , Weizhong Zhu , Michael Vogeley, Delineating the citation impact of scientific discoveries, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ding Zhou , Jiang Bian , Shuyi Zheng , Hongyuan Zha , C. Lee Giles, Exploring social annotations for information retrieval, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
Jie Tang , Jing Zhang , Limin Yao , Juanzi Li , Li Zhang , Zhong Su, ArnetMiner: extraction and mining of academic social networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
Fabian Mörchen , Mathäus Dejori , Dmitriy Fradkin , Julien Etienne , Bernd Wachmann , Markus Bundschus, Anticipating annotations and emerging trends in biomedical literature, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Erik Linstead , Sushil Bajracharya , Trung Ngo , Paul Rigor , Cristina Lopes , Pierre Baldi, Sourcerer: mining and searching internet-scale software repositories, Data Mining and Knowledge Discovery, v.18 n.2, p.300-336, April 2009
|
|
|
|
|
|
|
|
|
Wen-Yen Chen , Jon-Chyuan Chu , Junyi Luan , Hongjie Bai , Yi Wang , Edward Y. Chang, Collaborative filtering for orkut communities: discovery of user latent behavior, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
Levent Bolelli , Seyda Ertekin , Ding Zhou , C. Lee Giles, Finding topic trends in digital libraries, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|
|
|
|
|
|
|
|
|
|