|
ABSTRACT
We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by “translating” it into a more concise representation rather than attempting to extract a text span verbatim. OCELOT uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Adam L. Berger , Peter F. Brown , Stephen A. Della Pietra , Vincent J. Della Pietra , John R. Gillett , John D. Lafferty , Robert L. Mercer , Harry Printz , Luboš Ureš, The Candide system for machine translation, Proceedings of the workshop on Human Language Technology, March 08-11, 1994, Plainsboro, NJ
[doi> 10.3115/1075812.1075844]
|
| |
2
|
Berger, A., and Lafferty, J. The Weaver system for document retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8) (1999).
|
| |
3
|
|
| |
4
|
Clarkson, P., and Rosenfeld, R. Statistical language modeling using the CMU-Cambddge toolkit. In Proceedings of Eurospeech '97 (1997).
|
| |
5
|
DeJong, G. F. An overview of the FRUMP system. In Strategies for Natural Language Processing, W. G. Lehnert and M. H. Ringle, Eds. Lawrence Erlbaum Associates, 1982, pp. 149-176.
|
 |
6
|
|
| |
7
|
Fomey, G. D. The Viterbi Algorithm. Proceedings of the IEEE (1973), 268-278.
|
 |
8
|
Jade Goldstein , Mark Kantrowitz , Vibhu Mittal , Jaime Carbonell, Summarizing text documents: sentence selection and evaluation metrics, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.121-128, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312665]
|
| |
9
|
Good, I. The population frequencies of species and the estimation of population parameters. Biometrika 40 (1953).
|
| |
10
|
Hand, T. E A proposal for task-based evaluation of text summarization systems. In ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization (July 1997), pp. 31-36.
|
| |
11
|
|
| |
12
|
Jing, H., Barzilay, R., McKeown, K., and Elhadad, M. Summarization evaluation methods experiments and analysis. In AAAI Intelligent Text Summarization Workshop (Mar. 1998), pp. 60-68.
|
| |
13
|
Luhn, R H. Automatic creation of literature abstracts. IBM Journal (1958), 159-165.
|
| |
14
|
Marcu, D. From discourse structures to text summaries. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization (1997), pp. 82-88.
|
| |
15
|
Mathis, B. A., Rush, J. E., and Young, C. E. Improvement of automatic abstracts by the use of structural analysis. JA- SIS 24 (1973), 101-109.
|
| |
16
|
Nathan, K., Beigi, H., Subrahmonia, J., Clary, G., and Maruyama, H. Real-time on-line unconstrained handwriting recognition using statistical methods. In Proceedings of the 1EEE ICASSP-95 Conference (1995).
|
| |
17
|
The Open Directory project: http : //draoz. org.
|
 |
18
|
|
| |
19
|
|
 |
20
|
|
CITED BY 37
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dou Shen , Zheng Chen , Qiang Yang , Hua-Jun Zeng , Benyu Zhang , Yuchang Lu , Wei-Ying Ma, Web-page classification through summarization, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jeffrey P. Bigham , Ryan S. Kaminsky , Richard E. Ladner , Oscar M. Danielsson , Gordon L. Hempton, WebInSight:: making web images accessible, Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility, October 23-25, 2006, Portland, Oregon, USA
|
|
|
|
|
|
|
|
|
|
|
|
Aditya Gupta , Anuj Kumar , Mayank , V. N. Tripathi , S. Tapaswi, Mobile web: web manipulation for small displays using multi-level hierarchy page segmentation, Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology, September 10-12, 2007, Singapore
|
|
|
|
|
|
|
|
|
Aris Anagnostopoulos , Andrei Z. Broder , Evgeniy Gabrilovich , Vanja Josifovski , Lance Riedel, Just-in-time contextual advertising, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|