|
ABSTRACT
This paper presents a novel domain-independent text segmentation method, which identifies the boundaries of topic changes in long text documents and/or text streams. The method consists of three components: As a preprocessing step, we eliminate the document-dependent stop words as well as the generic stop words before the sentence similarity is computed. This step assists in the discrimination of the sentence semantic information. Then the cohesion information of sentences in a document or a text stream is captured with a sentence-distance matrix with each entry corresponding to the similarity between a sentence pair. The distance matrix can be represented with a gray-scale image. Thus, a text segmentation problem is converted into an image segmentation problem. We apply the anisotropic diffusion technique to the image representation of the distance matrix to enhance the semantic cohesion of sentence topical groups as well as sharpen topical boundaries. At last, the dynamic programming technique is adapted to find the optimal topical boundaries and provide a zoom-in and zoom-out mechanism for topics access by segmenting text in variable numbers of sentence topical groups. Our approach involves no domain-specific training, and it can be applied to texts in a variety of domains. The experimental results show that our approach is effective in text segmentation and outperforms several state-of-the-art methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic Dectection and Tracking pilot study final report. In Proceedings of the DARPA broadcast news transcription and understanding workshop, pp.194--218, 1998.
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Mochizuki Hajime , Honda Takeo , Okumura Manabu, Text segmentation with multiple surface linguistic cues, Proceedings of the 36th annual meeting on Association for Computational Linguistics, p.881-885, August 10-14, 1998, Montreal, Quebec, Canada
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
T. K. Ho. Stop Word Location and Identification for Adaptive Text Recognition. International Journal of Document Analysis and Recognition, 3, 1, August 2000.
|
| |
14
|
X. Ji and H. Zha. Extracting Shared Topics of Multiple Documents. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2003.
|
| |
15
|
M. Y. Kan, J. L. Klavans, and K. R. McKeown. Linear segmentation and segment significance. In Proc. of the 6th International Workshop of Very Large Corpora, pp.197--205, 1998.
|
| |
16
|
|
| |
17
|
I. Mani. Automatic Summarization. John Benjamins Pub Co., 2001.
|
| |
18
|
|
| |
19
|
|
| |
20
|
M. Porter. The Porter Stemming Algorithm. www.tartarus.org/~martin/PorterStemmer
|
| |
21
|
|
| |
22
|
|
 |
23
|
Gerard Salton , J. Allan , Chris Buckley, Approaches to passage retrieval in full text information systems, Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, p.49-58, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/160688.160693]
|
 |
24
|
Gerard Salton , Amit Singhal , Chris Buckley , Mandar Mitra, Automatic text decomposition using text segments and text themes, Proceedings of the the seventh ACM conference on Hypertext, p.53-65, March 16-20, 1996, Bethesda, Maryland, United States
[doi> 10.1145/234828.234834]
|
| |
25
|
|
| |
26
|
C. Wayne. Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation. In Proc. of Language Resources and Evaluation Conference, pages 1487--1494, 2000
|
| |
27
|
Y. Yamron. Segmentation of expository texts by hierarchical agglomerative clustering. In Proc. of Recent Advances in Natural Language Proceessings, 1997.
|
 |
28
|
|
 |
29
|
|
CITED BY 9
|
|
Bingjun Sun , Ding Zhou , Hongyuan Zha , John Yen, Multi-task text segmentation and alignment based on weighted mutual information, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
Dou Shen , Qiang Yang , Jian-Tao Sun , Zheng Chen, Thread detection in dynamic text message streams, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
Bingjun Sun , Prasenjit Mitra , C. Lee Giles , John Yen , Hongyuan Zha, Topic segmentation with shared topic detection and alignment of multiple documents, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
Gaël Dias , Elsa Alves , José Gabriel Pereira Lopes, Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation, Proceedings of the 22nd national conference on Artificial intelligence, p.1334-1339, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|