|
ABSTRACT
In this paper we address the problem of detecting topics in large-scale linked document collections. Recently, topic detection has become a very active area of research due to its utility for information navigation, trend analysis, and high-level description of data. We present a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term. This tight coupling between term and graph analysis is distinguished from other approaches such as those that focus on language models. We develop a topic score measure for each term, using the likelihood ratio of binary hypotheses based on a probabilistic description of graph connectivity. Our approach is based on the intuition that if a term is relevant to a topic, the documents containing the term have denser connectivity than a random selection of documents. We extend our algorithm to detect a topic represented by a set of terms, using the intuition that if the co-occurrence of terms represents a new topic, the citation pattern should exhibit the synergistic effect. We test our algorithm on two electronic research literature collections,arXiv and Citeseer.Our evaluation shows that the approach is effective and reveals some novel aspects of topic detection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
 |
2
|
|
| |
3
|
arXiv. http://arxiv.org.
|
| |
4
|
D. M. Blei and J. D. Lafferty. Correlated topic models. In NIPS, 2005.
|
| |
5
|
L. Bolelli, S. Ertekin, and C. L. Giles. Clustering scientific literature using sparse citation graph analysis. In PKDD, pages 30--41, 2006.
|
| |
6
|
Citeseer. http://citeseer.ist.psu.edu.
|
| |
7
|
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101, 2004.
|
 |
8
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
| |
9
|
T. I. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, (5):5228--5235, 2004.
|
 |
10
|
|
 |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. Technical Report, 2004.
|
 |
17
|
|
 |
18
|
Daniel B. Neill , Andrew W. Moore , Maheshkumar Sabhnani , Kenny Daniel, Detection of emerging space-time clusters, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081897]
|
| |
19
|
M. Newman. Scientific collaboration networks. i. network construction and fundamental results. PHYSICAL REVIEW E, 64, 2001.
|
| |
20
|
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. arXiv:cond-mat/0308217, 2003.
|
 |
21
|
|
 |
22
|
Mark Steyvers , Padhraic Smyth , Michal Rosen-Zvi , Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014087]
|
 |
23
|
|
 |
24
|
Ding Zhou , Eren Manavoglu , Jia Li , C. Lee Giles , Hongyuan Zha, Probabilistic models for discovering e-communities, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
[doi> 10.1145/1135777.1135807]
|
|