|
ABSTRACT
Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Arindam Banerjee , Inderjit Dhillon , Joydeep Ghosh , Srujana Merugu , Dharmendra S. Modha, A generalized maximum entropy approach to bregman co-clustering and matrix approximation, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014111]
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals. Maximum entropy segmentation of broadcast news. In Proceedings of ICASSP, 2005.
|
| |
8
|
|
| |
9
|
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Systems, 1990.
|
 |
10
|
|
| |
11
|
Mochizuki Hajime , Honda Takeo , Okumura Manabu, Text segmentation with multiple surface linguistic cues, Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, p.881-885, August 10-14, 1998, Montreal, Quebec, Canada
|
| |
12
|
T. K. Ho. Stop word location and identification for adaptive text recognition. International Journal of Document Analysis and Recognition, 3(1), August 2000.
|
 |
13
|
|
| |
14
|
X. Ji and H. Zha. Correlating summarization of a pair of multilingual documents. In Proceedings of RIDE, 2003.
|
 |
15
|
|
| |
16
|
X. Ji and H. Zha. Extracting shared topics of multiple documents. In Proceedings of the 7th PAKDD, 2003.
|
| |
17
|
|
 |
18
|
Tao Li , Sheng Ma , Mitsunori Ogihara, Entropy-based criterion in categorical clustering, Proceedings of the twenty-first international conference on Machine learning, p.68, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015404]
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
Bingjun Sun , Ding Zhou , Hongyuan Zha , John Yen, Multi-task text segmentation and alignment based on weighted mutual information, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183760]
|
| |
24
|
|
| |
25
|
C. Wayne. Multilingual topic detection and tracking: Successful research enabled by corpora and evaluation. In Proceedings of LREC, 2000.
|
| |
26
|
J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. A hidden markov model approach to text segmentation and event tracking. In Proceedings of ICASSP, 1998.
|
 |
27
|
|
CITED BY 3
|
|
|
|
|
Bingjun Sun , Prasenjit Mitra , C. Lee Giles, Mining, indexing, and searching for textual chemical molecule information on the web, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|