ACM Home Page
Please provide us with feedback. Feedback
Predicting the readability of short web summaries
Full text PdfPdf (1.14 MB)
Source Web Search and Web Data Mining archive
Proceedings of the Second ACM International Conference on Web Search and Data Mining table of contents
Barcelona, Spain
SESSION: Web mining II table of contents
Pages 202-211  
Year of Publication: 2009
ISBN:978-1-60558-390-7
Authors
Tapas Kanungo  Yahoo! Labs, Santa Clara, CA
David Orr  Yahoo! Labs, Santa Clara, CA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
: Google
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
: Yahoo! Research
Microsoft : Microsoft
: Nokia
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 177,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1498759.1498827
What is a DOI?

ABSTRACT

Readability is a crucial presentation attribute that web summarization algorithms consider while generating a querybaised web summary. Readability quality also forms an important component in real-time monitoring of commercial search-engine results since readability of web summaries impacts clickthrough behavior, as shown in recent studies, and thus impacts user satisfaction and advertising revenue.

The standard approach to computing the readability is to first collect a corpus of random queries and their corresponding search result summaries, and then each summary is then judged by a human for its readabilty quality. An average readability score is then reported. This process is time consuming and expensive. Besides, the manual evaluation process can not be used in the real-time summary generation process. In this paper we propose a machine learning approach to the problem. We use the corpus as described above and extract summary features that we think may characterize readability. We then estimate a model (gradient boosted decision tree) that predicts human judgments given the features. This model can then be used in real time to estimate the readability of new (unseen) web search summaries and also be used in the summary generation process.

We present results on approximately 5000 editorial judgments collected over the course of a year and show examples where the model predicts the quality well and where it disagrees with human judgments. We compare the results of the model to previous models of readability, most notably Collins-Thompson-Callan, Fog and Flesch-Kincaid, and see that our model shows substantially better correlation with editorial judgments as measured by Pearson's correlation coefficient. The learning algorithm also provides us with the relative importance of the features used.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The R project for statistical computing. http://r-project.org.
2
 
3
A. Aula. Enhancing the readability of search result summaries. In Proc. of HCI, 2004.
4
 
5
6
 
7
K. Collins-Thompson and J. Callan. A language modeling approach to predicting reading difficulty. In Proceedings of HLT/NAACL, 2004.
 
8
J. H. Friedman. Greedy function approximation: A graidient boosting machine. Annals of Statistics, 29:1189--1232, 2001. http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf.
 
9
 
10
R. Gunning. The technique of clear writing. McGraw-Hill, 1952.
 
11
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Sringer-Verlag, New York, NY, 2001.
 
12
13
14
 
15
M. D. Kickmeier and D. Albert. The effects of scanability on information search: An online experiment. In Proc. of HCI, 2003.
 
16
J. P. Kincaid, R. P. Fishburn, R. L. Rogers, and B. S. Chissom. Derivation of new redability formulas for navy enlisted personnel. Technical report, Milington, Tenn, Naval Air Station, 1975. Tech Report Research Branch Report 8-75.
 
17
G. Legge. Psychophysics of Reading in Normal and Low Vision. Lawrence Erlbaum Associates, 2006.
 
18
P. Li, C. J. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In Proc. 21st Proc. of Advances in Neural Information Processing Systems, 2007.
 
19
S. F. Liang, S. Delvin, and J. Tait. Evaluating web search result summaries. In European Conference in IR Research, pages 96--106, 2006.
 
20
G. H. McLaughlin. SMOG grading: A new readability formula. Journal of Reading, 12:639--646, 1969.
21
22
 
23
 
24
K. Rayner. Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124:372--422, 1998.
 
25
G. Ridgeway. Generalized boosted models: A guide to the gbm package. http://i-pensieri.com/gregr/papers/gbm-vignette.pdf.
 
26
G. Ridgeway. The state of boosting. Computing Science and Statistics, 31:172--181, 1999. http://www.i-pensieri.com/gregr/papers/interface99.pdf.
27
 
28
K. Ryan. Fathom. http://search.cpan.org/dist/Lingua-EN-Fathom.
29
30
 
31
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Sringer-Verlag, New York, NY, 2002.
32
 
33
Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun. A general boosting method and its application to learning ranking functions for web search. In Proc. 21st Proc. of Advances in Neural Information Processing Systems, 2007.

Collaborative Colleagues:
Tapas Kanungo: colleagues
David Orr: colleagues