ACM Home Page
Please provide us with feedback. Feedback
Probabilistic author-topic models for information discovery
Full text PdfPdf (324 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
SESSION: Research track papers table of contents
Pages: 306 - 315  
Year of Publication: 2004
ISBN:1-58113-888-1
Authors
Mark Steyvers  University of California - Irvine, Irvine, CA
Padhraic Smyth  University of California - Irvine, Irvine, CA
Michal Rosen-Zvi  University of California - Irvine, Irvine, CA
Thomas Griffiths  Stanford University, Stanford, CA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 184,   Citation Count: 34
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1014087
What is a DOI?

ABSTRACT

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Buntine, W.L. (1994) Operations for learning with graphical models, Journal of Artificial Intelligence Research 2, pp. 159--225.
3
 
4
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990) Indexing by latent semantic analysis, Journal of the American Society of Information Science, 41(6), pp. 391--407.
 
5
 
6
Erten, C., Harding, P. J., Kobourov, S. G., Wampler, K., and Yee, G. (2003) Exploring the computing literature using temporal graph visualization, Technical Report, Department of Computer Science, University of Arizona.
 
7
Gray, A., Sallis, P., MacDonell, S. (1997) Software forensics: Extending authorship analysis techniques to computer programs, Proceedings of the 3rd Biannual Conference of the International Association of Forensic Linguists (IAFL), Durham NC.
 
8
Griffiths, T. L., and Steyvers , M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228--5235.
9
10
 
11
 
12
13
 
14
Mosteller, F., and Wallace, D. (1964) Applied Bayesian and Classical Inference: The Case of the Federalist Papers, Springer-Verlag.
 
15
Mutschke, P. (2003) Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks, Intelligent Data Analysis 2003, Lecture Notes in Computer Science 2810, Springer Verlag, pp. 155--166
 
16
Newman, M. E. J. (2001) Scientific collaboration networks: I. Network construction and fundamental results, Physical Review E, 64, 016131.
 
17
 
18
 
19
Thisted, B., and Efron, R. (1987) Did Shakespeare write a newly discovered poem?, Biometrika, pp. 445--455.
20
 
21

CITED BY  34

Collaborative Colleagues:
Mark Steyvers: colleagues
Padhraic Smyth: colleagues
Michal Rosen-Zvi: colleagues
Thomas Griffiths: colleagues