ACM Home Page
Please provide us with feedback. Feedback
A maximum entropy approach to identifying sentence boundaries
Full text Publisher SitePublisher Site PdfPdf (353 KB)
Source Applied Natural Language Conferences archive
Proceedings of the fifth conference on Applied natural language processing table of contents
Washington, DC
Pages: 16 - 19  
Year of Publication: 1997
Authors
Jeffrey C. Reynar  University of Pennsylvania, Philadelphia, Pennsylvania
Adwait Ratnaparkhi  University of Pennsylvania, Philadelphia, Pennsylvania
Publisher
Association for Computational Linguistics  Morristown, NJ, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 27,   Citation Count: 52
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.3115/974557.974561

ABSTRACT

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
Darroch, J. N. and D. Ratcliff. 1972. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480.
 
5
Liberman, Mark Y. and Kenneth W. Church. 1992. Text analysis and word pronunciation in text-to-speech synthesis. In Sadaoki Furui and M. Mohan Sondi, editors, Advances in Speech Signal Processing. Marcel Dekker, Incorporated, New York.
 
6
 
7
Nunberg, Geoffrey. 1990. The Linguistics of Punctuation. Number 18 in CSLI Lecture Notes. University of Chicago Press.
 
8
 
9
 
10
Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pages 133--142, University of Pennsylvania, May 17--18.
 
11
Riley, Michael D. 1989. Some applications of tree-based modelling to speech and language. In DARPA Speech and Language Technology Workshop, pages 339--352, Cape Cod, Massachusetts.
 
12
White, Michael. 1995. Presenting punctuation. In Proceedings of the Fifth European Workshop on Natural Language Generation, pages 107--125, Leiden. The Netherlands.

CITED BY  52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Collaborative Colleagues:
Jeffrey C. Reynar: colleagues
Adwait Ratnaparkhi: colleagues