| Scalable community discovery on textual data with relations |
| Full text |
Pdf
(442 KB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceeding of the 17th ACM conference on Information and knowledge management
table of contents
Napa Valley, California, USA
SESSION: KM: text mining
table of contents
Pages 1203-1212
Year of Publication: 2008
ISBN:978-1-59593-991-3
|
|
Authors
|
|
Huajing Li
|
The Pennsylvania State University, University Park, PA, USA
|
|
Zaiqing Nie
|
Microsoft Research Asia, Beijing, China
|
|
Wang-Chien Lee
|
The Pennsylvania State University, University Park, PA, USA
|
|
Lee Giles
|
The Pennsylvania State University, University Park, PA, USA
|
|
Ji-Rong Wen
|
Microsoft Research Asia, Beijing, China
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 6, Downloads (12 Months): 189, Citation Count: 0
|
|
|
ABSTRACT
Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations. Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15 topical precision improvement compared with popular clustering techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
 |
2
|
Edoardo Airoldi , David Blei , Eric Xing , Stephen Fienberg, A latent mixed membership model for relational data, Proceedings of the 3rd international workshop on Link discovery, p.82-89, August 21-25, 2005, Chicago, Illinois
[doi> 10.1145/1134271.1134283]
|
| |
3
|
|
 |
4
|
Yun Chi , Shenghuo Zhu , Xiaodan Song , Junichi Tatemura , Belle L. Tseng, Structural and temporal analysis of the blogosphere through community factorization, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
[doi> 10.1145/1281192.1281213]
|
| |
5
|
|
| |
6
|
D. A. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 430--436. MIT Press, 2000.
|
| |
7
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
|
 |
8
|
|
 |
9
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
 |
10
|
Bin Gao , Tie-Yan Liu , Xin Zheng , Qian-Sheng Cheng , Wei-Ying Ma, Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081879]
|
 |
11
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
12
|
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.
|
 |
13
|
|
| |
14
|
T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999.
|
| |
15
|
|
 |
16
|
|
 |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
|
| |
21
|
M. Rosen-Zvi, T. Griffiths, P. Smyth, and M. Steyvers. Learning author topic models from text corpora. Technical report, November 2005.
|
 |
22
|
|
| |
23
|
W.-J. Zhou, J.-R. Wen, W.-Y. Ma, and H.-J. Zhang. A concentric-circle model for community mining in graph structures. Technical Report MSR-TR-2002-123, Microsoft Research Asia, Beijing, China, November 2002.
|
|