|
ABSTRACT
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and to then rank documents by the likelihood of the query according to the estimated language model. A central issue in language model estimation is smoothing, the problem of adjusting the maximum likelihood estimator to compensate for data sparseness. In this article, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections. Experimental results show that not only is the retrieval performance generally sensitive to the smoothing parameters, but also the sensitivity pattern is affected by the query type, with performance being more sensitive to smoothing for verbose queries than for keyword queries. Verbose queries also generally require more aggressive smoothing to achieve optimal performance. This suggests that smoothing plays two different role---to make the estimated document language model more accurate and to "explain" the noninformative words in the query. In order to decouple these two distinct roles of smoothing, we propose a two-stage smoothing strategy, which yields better sensitivity patterns and facilitates the setting of smoothing parameters automatically. We further propose methods for estimating the smoothing parameters automatically. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothing method and exhaustive parameter search on the test data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Chen, S. F. and Goodman, J. 1998. An empirical study of smoothing techniques for language modeling. Tech. Rep. TR-10-98, Harvard University.
|
| |
3
|
|
| |
4
|
Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, parts 3, 4, 237--264.
|
| |
5
|
Hiemstra, D. and Kraaij, W. 1999. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proceedings of 7th Text REtrieval Conference (TREC-7). 227--238.
|
| |
6
|
Jelinek, F. and Mercer, R. 1980. Interpolated estimation of markov sourceparameters from sparse data. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 381--402.
|
| |
7
|
Katz, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics, Speech and Signal Processing (ASSP) 35 400--401.
|
| |
8
|
Kneser, R. and Ney, H. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE Computer Society Press, Los Alamitos, Calif., 181--184.
|
 |
9
|
|
 |
10
|
|
 |
11
|
John Lafferty , Chengxiang Zhai, Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.111-119, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383970]
|
 |
12
|
|
| |
13
|
MacKay, D. and Peto, L. 1995. A hierarchical Dirichlet language model. Nat. Lang. Eng. 1, 3, 289--307.
|
 |
14
|
David R. H. Miller , Tim Leek , Richard M. Schwartz, A hidden Markov model information retrieval system, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.214-221, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312680]
|
| |
15
|
Ney, H., Essen, U., and Kneser, R. 1994. On structuring probabilistic dependencies in stochastic language modeling. Comput. Speech Lang. 8, 1--38.
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
Robertson, S. E., Walker, S., Jones, S., M.Hancock-Beaulieu, M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3), D. K. Harman, Ed. 109--126.
|
| |
21
|
|
| |
22
|
Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Amer. Soc. Inf. Sci. 44, 4, 288--297.
|
 |
23
|
|
 |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
van Rijsbergen, C. J. 1986. A non-classical logic for information retrieval. Comput. J. 29, 6, 481--485.
|
| |
28
|
Voorhees, E. and Harman, D., Eds. 2001. Proceedings of Text REtrieval Conference (TREC1-9). NIST Special Publications. http://trec.nist.gov/pubs.html.
|
 |
29
|
|
| |
30
|
Zhai, C. and Lafferty, J. 2001. Model-based feedback in the KL-divergence retrieval model. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM 2001). 403--410.
|
CITED BY 49
|
|
|
|
|
|
|
Dong Zhou , Mark Truran , Tim Brailsford , Helen Ashman , Amir Pourabdollah, Llama-b: automatic hyperlink authoring in the blogosphere, Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, June 19-21, 2008, Pittsburgh, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ding Zhou , Jiang Bian , Shuyi Zheng , Hongyuan Zha , C. Lee Giles, Exploring social annotations for information retrieval, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
Chin-Yew Lin , Guihong Cao , Jianfeng Gao , Jian-Yun Nie, An information-theoretic approach to automatic evaluation of summaries, Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.463-470, June 04-09, 2006, New York, New York
|
|
Jiyin He , Wouter Weerkamp , Martha Larson , Maarten de Rijke, Blogger, stick to your story: modeling topical noise in blogs with coherence measures, Proceedings of the second workshop on Analytics for noisy unstructured text data, p.39-46, July 24-24, 2008, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shengliang Xu , Shenghua Bao , Ben Fei , Zhong Su , Yong Yu, Exploring folksonomy for personalized search, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
|
|
|
|
Hongyuan Zha , Zhaohui Zheng , Haoying Fu , Gordon Sun, Incorporating query difference for learning retrieval functions in world wide web search, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Subjects:
Retrieval models
Additional Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.4
Systems and Software
Subjects:
Performance evaluation (efficiency and effectiveness)
General Terms:
Algorithms,
Experimentation,
Performance
Keywords:
Dirichlet prior smoothing,
EM algorithm,
Jelinek--Mercer smoothing,
Statistical language models,
TF-IDF weighting,
absolute discounting smoothing,
backoff smoothing,
interpolation smoothing,
leave-one-out,
risk minimization,
term weighting,
two-stage smoothing
|