ACM Home Page
Please provide us with feedback. Feedback
Topic segmentation with shared topic detection and alignment of multiple documents
Full text PdfPdf (298 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Amsterdam, The Netherlands
SESSION: Topic detection and tracking table of contents
Pages: 199 - 206  
Year of Publication: 2007
ISBN:978-1-59593-597-7
Authors
Bingjun Sun  The Pennsylvania State University
Prasenjit Mitra  The Pennsylvania State University
C. Lee Giles  The Pennsylvania State University
John Yen  The Pennsylvania State University
Hongyuan Zha  The Georgia Institute of Technology
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 201,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1277741.1277778
What is a DOI?

ABSTRACT

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
5
 
6
 
7
H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals. Maximum entropy segmentation of broadcast news. In Proceedings of ICASSP, 2005.
 
8
 
9
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Systems, 1990.
10
 
11
 
12
T. K. Ho. Stop word location and identification for adaptive text recognition. International Journal of Document Analysis and Recognition, 3(1), August 2000.
13
 
14
X. Ji and H. Zha. Correlating summarization of a pair of multilingual documents. In Proceedings of RIDE, 2003.
15
 
16
X. Ji and H. Zha. Extracting shared topics of multiple documents. In Proceedings of the 7th PAKDD, 2003.
 
17
18
 
19
 
20
 
21
 
22
23
 
24
 
25
C. Wayne. Multilingual topic detection and tracking: Successful research enabled by corpora and evaluation. In Proceedings of LREC, 2000.
 
26
J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. A hidden markov model approach to text segmentation and event tracking. In Proceedings of ICASSP, 1998.
27


Collaborative Colleagues:
Bingjun Sun: colleagues
Prasenjit Mitra: colleagues
C. Lee Giles: colleagues
John Yen: colleagues
Hongyuan Zha: colleagues