|
ABSTRACT
We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
Darroch, J. N. and D. Ratcliff. 1972. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480.
|
| |
5
|
Liberman, Mark Y. and Kenneth W. Church. 1992. Text analysis and word pronunciation in text-to-speech synthesis. In Sadaoki Furui and M. Mohan Sondi, editors, Advances in Speech Signal Processing. Marcel Dekker, Incorporated, New York.
|
| |
6
|
|
| |
7
|
Nunberg, Geoffrey. 1990. The Linguistics of Punctuation. Number 18 in CSLI Lecture Notes. University of Chicago Press.
|
| |
8
|
|
| |
9
|
|
| |
10
|
Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pages 133--142, University of Pennsylvania, May 17--18.
|
| |
11
|
Riley, Michael D. 1989. Some applications of tree-based modelling to speech and language. In DARPA Speech and Language Technology Workshop, pages 339--352, Cape Cod, Massachusetts.
|
| |
12
|
White, Michael. 1995. Presenting punctuation. In Proceedings of the Fifth European Workshop on Natural Language Generation, pages 107--125, Leiden. The Netherlands.
|
CITED BY 52
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fabio Rinaldi , Gerold Schneider , Kaarel Kaljurand , Michael Hess , Christos Andronis , Ourania Konstandi , Andreas Persidis, Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach, Arificial Intelligence in Medicine, v.39 n.2, p.127-136, February, 2007
|
|
|
|
|
|
|
|
|
Kazuya Shitaoka , Kiyotaka Uchimoto , Tatsuya Kawahara , Hitoshi Isahara, Dependency structure analysis and sentence boundary detection in spontaneous Japanese, Proceedings of the 20th international conference on Computational Linguistics, p.1107-es, August 23-27, 2004, Geneva, Switzerland
|
|
|
|
|
|
|
|
|
|
|
|
Breck Baldwin , Thomas S. Morton , Amit Bagga, Overview of the University of Pennsylvania's TIPSTER project: University of Pennsylvania, Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, October 13-15, 1998, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chris Buckley , Janet Walz , Claire Cardie , Scott Mardis , Mandar Mitra , David Pierce , Kiri Wagstaff, The Smart/Empire TIPSTER IR system, Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, October 13-15, 1998, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
G. Iyengar , P. Duygulu , S. Feng , P. Ircing , S. P. Khudanpur , D. Klakow , M. R. Krause , R. Manmatha , H. J. Nock , D. Petkova , B. Pytlik , P. Virga, Joint visual-text modeling for automatic retrieval of multimedia documents, Proceedings of the 13th annual ACM international conference on Multimedia, November 06-11, 2005, Hilton, Singapore
|
|
|
|
|
|
Aaron Elkiss , Siwei Shen , Anthony Fader , Güneş Erkan , David States , Dragomir Radev, Blind men and elephants: What do citation summaries tell us about a research article?, Journal of the American Society for Information Science and Technology, v.59 n.1, p.51-62, January 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Henry S. Baird , Daniel Lopresti , Brian D. Davison , William M. Pottenger, Robust document image understanding technologies, Proceedings of the 1st ACM workshop on Hardcopy document processing, p.9-14, November 12-12, 2004, Washington, DC, USA
|
|