ACM Home Page
Please provide us with feedback. Feedback
Domain-independent text segmentation using anisotropic diffusion and dynamic programming
Full text PdfPdf (172 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval table of contents
Toronto, Canada
SESSION: Novelty and topic change table of contents
Pages: 322 - 329  
Year of Publication: 2003
ISBN:1-58113-646-3
Authors
Xiang Ji  The Pennsylvania State University, University Park, PA
Hongyuan Zha  The Pennsylvania State University, University Park, PA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 76,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/860435.860494
What is a DOI?

ABSTRACT

This paper presents a novel domain-independent text segmentation method, which identifies the boundaries of topic changes in long text documents and/or text streams. The method consists of three components: As a preprocessing step, we eliminate the document-dependent stop words as well as the generic stop words before the sentence similarity is computed. This step assists in the discrimination of the sentence semantic information. Then the cohesion information of sentences in a document or a text stream is captured with a sentence-distance matrix with each entry corresponding to the similarity between a sentence pair. The distance matrix can be represented with a gray-scale image. Thus, a text segmentation problem is converted into an image segmentation problem. We apply the anisotropic diffusion technique to the image representation of the distance matrix to enhance the semantic cohesion of sentence topical groups as well as sharpen topical boundaries. At last, the dynamic programming technique is adapted to find the optimal topical boundaries and provide a zoom-in and zoom-out mechanism for topics access by segmenting text in variable numbers of sentence topical groups. Our approach involves no domain-specific training, and it can be applied to texts in a variety of domains. The experimental results show that our approach is effective in text segmentation and outperforms several state-of-the-art methods.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic Dectection and Tracking pilot study final report. In Proceedings of the DARPA broadcast news transcription and understanding workshop, pp.194--218, 1998.
 
2
 
3
4
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
T. K. Ho. Stop Word Location and Identification for Adaptive Text Recognition. International Journal of Document Analysis and Recognition, 3, 1, August 2000.
 
14
X. Ji and H. Zha. Extracting Shared Topics of Multiple Documents. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2003.
 
15
M. Y. Kan, J. L. Klavans, and K. R. McKeown. Linear segmentation and segment significance. In Proc. of the 6th International Workshop of Very Large Corpora, pp.197--205, 1998.
 
16
 
17
I. Mani. Automatic Summarization. John Benjamins Pub Co., 2001.
 
18
 
19
 
20
M. Porter. The Porter Stemming Algorithm. www.tartarus.org/~martin/PorterStemmer
 
21
 
22
23
24
 
25
 
26
C. Wayne. Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation. In Proc. of Language Resources and Evaluation Conference, pages 1487--1494, 2000
 
27
Y. Yamron. Segmentation of expository texts by hierarchical agglomerative clustering. In Proc. of Recent Advances in Natural Language Proceessings, 1997.
28
29

CITED BY  9