| Mining multi-faceted overviews of arbitrary topics in a text collection |
| Full text |
Pdf
(409 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Las Vegas, Nevada, USA
SESSION: Research papers
table of contents
Pages 497-505
Year of Publication: 2008
ISBN:978-1-60558-193-4
|
|
Authors
|
|
Xu Ling
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Qiaozhu Mei
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
ChengXiang Zhai
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Bruce Schatz
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 25, Downloads (12 Months): 337, Citation Count: 1
|
|
|
ABSTRACT
A common task in many text mining applications is to generate a multi-faceted overview of a topic in a text collection. Such an overview not only directly serves as an informative summary of the topic, but also provides a detailed view of navigation to different facets of the topic. Existing work has cast this problem as a categorization problem and requires training examples for each facet. This has three limitations: (1) All facets are predefined, which may not fit the need of a particular user. (2) Training examples for each facet are often unavailable. (3) Such an approach only works for a predefined type of topics. In this paper, we break these limitations and study a more realistic new setup of the problem, in which we would allow a user to flexibly describe each facet with keywords for an arbitrary topic and attempt to mine a multi-faceted overview in an unsupervised way. We attempt a probabilistic approach to solve this problem. Empirical experiments on different genres of text data show that our approach can effectively generate a multi-faceted overview for arbitrary topics; the generated overviews are comparable with those generated by supervised methods with training examples. They are also more informative than unstructured flat summaries. The method is quite general, thus can be applied to multiple text mining tasks in different application domains.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
 |
4
|
|
 |
5
|
Daniel Gruhl , R. Guha , Ravi Kumar , Jasmine Novak , Andrew Tomkins, The predictive power of online chatter, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081883]
|
 |
6
|
Daniel Gruhl , R. Guha , David Liben-Nowell , Andrew Tomkins, Information diffusion through blogspace, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988739]
|
 |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
Kullback, S. and Leibler, R. A. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79--86, mar 1951.
|
| |
11
|
X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. Schatz. Automatically generating gene summaries from biomedical literature. In Proceedings of PSB '06, pages 41--50, 2006.
|
| |
12
|
Xu Ling , Jing Jiang , Xin He , Qiaozhu Mei , Chengxiang Zhai , Bruce Schatz, Generating gene summaries from biomedical literature: A study of semi-structured summarization, Information Processing and Management: an International Journal, v.43 n.6, p.1777-1791, November, 2007
[doi> 10.1016/j.ipm.2007.01.018]
|
 |
13
|
|
 |
14
|
|
| |
15
|
G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, 1997.
|
 |
16
|
|
 |
17
|
Qiaozhu Mei , Xu Ling , Matthew Wondra , Hang Su , ChengXiang Zhai, Topic sentiment mixture: modeling facets and opinions in weblogs, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242596]
|
 |
18
|
|
| |
19
|
R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. pages 355--368, 1999.
|
 |
20
|
Peter Pirolli , Patricia Schank , Marti Hearst , Christine Diehl, Scatter/gather browsing communicates the topic structure of a very large text collection, Proceedings of the SIGCHI conference on Human factors in computing systems: common ground, p.213-220, April 13-18, 1996, Vancouver, British Columbia, Canada
[doi> 10.1145/238386.238489]
|
| |
21
|
M. A. C. R. A. Drysdale and T. F. Consortium. Flybase: genes and gene models. Nucleic Acids Res., 33:390--395, 2005.
|
| |
22
|
E. Stoica, M. Hearst, and M. Richardson. Automating creation of hierarchical faceted metadata structures. In Proceedings of NAACL/HLT '2007, pages 244--251, 2007.
|
| |
23
|
|
 |
24
|
|
| |
25
|
|
 |
26
|
Hua-Jun Zeng , Qi-Cai He , Zheng Chen , Wei-Ying Ma , Jinwen Ma, Learning to cluster web search results, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009030]
|
 |
27
|
|
| |
28
|
X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pages 912--919, 2003.
|
CITED BY
|
|
Munmun De Choudhury , Hari Sundaram , Ajita John , Dorée Duncan Seligmann, What makes conversations interesting?: themes, participants and consequences of conversations in online social media, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|