|
ABSTRACT
The language-modeling approach to information retrieval provides an effective statistical framework for tackling various problems and often achieves impressive empirical performance. However, most previous work on language models for information retrieval focused on document-specific characteristics, and therefore did not take into account the structure of the surrounding corpus, a potentially rich source of additional information. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in terms of mean average precision (MAP) and recall, and our new interpolation algorithm posts statistically significant performance improvements for both metrics over all six corpora tested. An important aspect of our work is the way we model corpus structure. In contrast to most previous work on cluster-based retrieval that partitions the corpus, we demonstrate the effectiveness of a simple strategy based on a nearest-neighbors approach that produces overlapping clusters.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X., Smucker, M. D., and Wade, C. 2004. UMASS at TREC 2004—novelty and hard. In Proceedings of the 13th Text Retrieval Conference (TREC-13).
|
| |
2
|
Allan, J., Connell, M. E., Croft, W. B., Feng, F.-F., Fisher, D., and Li, X. 2000. INQUERY and TREC-9. In Proceedings of the 9th Text Retrieval Conference (TREC-9). 551--562. NIST Special Publication 500-249. National Institute of Science and Technology, Gaithersburg, MD.
|
| |
3
|
Azzopardi, L., Girolami, M., and van Rijsbergen, K. 2004. Topic based language models for ad hoc information retrieval. In Proceedings of the International Conference on Neural Networks and IEEE International Conference on Fuzzy Systems. 3281--3286.
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
Connell, M., Feng, A., Kumaran, G., Raghavan, H., Shah, C., and Allan, J. 2004. UMass at TDT 2004. TDT2004 System Description. In Proceedings of IDT 2004.
|
| |
8
|
Croft, W. B. 1980. A model of cluster searching based on classification. Inform. Syst. 5, 189--195.
|
| |
9
|
|
| |
10
|
Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2004. A language modeling framework for selective query expansion. Tech. rep. IR-338, Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, MA.
|
| |
11
|
Danilowicz, C. and Baliński, J. 2000. Document ranking based upon Markov chains. Inform. Process. Manage. 41, 4, 759--775.
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
Hiemstra, D. and Kraaij, W. 1999. Twenty-One at TREC7: Ad hoc and cross-language track. In Proceedings of the 7th Text Retrieval Conference (TREC-7). 227--238.
|
| |
19
|
|
| |
20
|
Hofmann, T. and Puzicha, J. 1998. Unsupervised learning from dyadic data. Tech. rep. TR-98-042. International Computer Science Institute (ICSI), Berkely, CA.
|
 |
21
|
|
| |
22
|
Indyk, P. 2004. Nearest neighbors in high-dimensional spaces, In Handbook of Discrete and Computational Geometry, 2nd ed., J. E. Goodman and J. O'Rourke, Eds. CRC Press, Boca Raton, FL, Chapter 39.
|
| |
23
|
Iyer, R. and Ostendorf, M. 1999. Modeling long distance dependence in language: Topic mixtures vs. dynamic cache models. IEEE Trans. Speech Audio Process. 7, 1, 30--39.
|
| |
24
|
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchic clustering in information retrieval. Inform. Stor. Retr. 7, 5, 217--240.
|
| |
25
|
Jelinek, F. and Mercer, R. L. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice. North-Holland, Amsterdam, The Netherlands, 381--397.
|
| |
26
|
Kraaij, W. and Spitters, M. 2003. Language models for topic tracking: The importance of score normalization. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty Eds., Kluwer, Norwell, MA, Chapter 5, 95--124.
|
| |
27
|
|
 |
28
|
|
 |
29
|
|
 |
30
|
|
 |
31
|
|
 |
32
|
John Lafferty , Chengxiang Zhai, Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.111-119, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383970]
|
| |
33
|
Lavrenko, V. 2000. Localized smoothing of multinomial language models. Tech. rep. IR-222. Center for Intelligent Information Retrieval (CIIR), University of Massachusetts, Amherst, MA.
|
| |
34
|
|
| |
35
|
Victor Lavrenko , James Allan , Edward DeGuzman , Daniel LaFlamme , Veera Pollard , Stephen Thomas, Relevance models for topic detection and tracking, Proceedings of the second international conference on Human Language Technology Research, March 24-27, 2002, San Diego, California
|
 |
36
|
|
| |
37
|
Lavrenko, V. and Croft, W. B. 2003. Relevance models in information retrieval. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty Eds., Kluwer, Norwell, MA, 11--56.
|
 |
38
|
|
 |
39
|
|
 |
40
|
|
 |
41
|
|
| |
42
|
Metzler, D. 2005. Direct maximization of rank-based metrics. Tech. rep. IR-338425. Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, MA.
|
 |
43
|
|
| |
44
|
Metzler, D., Diaz, F., Strohman, T., and Croft, W. B. 2005. Using mixtures of relevance models for query expansion. In Proceedings of the 14th Text Retrieval Conference. (TREC)
|
 |
45
|
David R. H. Miller , Tim Leek , Richard M. Schwartz, A hidden Markov model information retrieval system, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.214-221, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312680]
|
 |
46
|
|
| |
47
|
Morgan, W., Greiff, W., and Henderson, J. 2004. Direct maximization of average precision by hill-climbing, with a comparison to a maximum entropy approach. Tech. rep. 04-0367. The MITRE Corporation, Beckford, MA/McLean, VA.
|
| |
48
|
Ng, K. 2000. A maximum likelihood ratio information retrieval model. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 483--492.
|
| |
49
|
Ogilvie, P. 2000. Nearest neighbor smoothing of language models in IR. Unpublished. http://www.cs.cmu.edu/People/pto/courses/11-743/nnlmsmooth.ps.
|
| |
50
|
|
 |
51
|
|
| |
52
|
Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewood Cliffs, NJ, 313--323.
|
| |
53
|
|
 |
54
|
|
 |
55
|
|
| |
56
|
Spitters, M. and Kraaij, W. 2001. TNO at TDT2001: Language model-based topic detection. In Proceedings of the Topic Detection and Tracking TDT Workshop.
|
| |
57
|
Tao Tao , Xuanhui Wang , Qiaozhu Mei , ChengXiang Zhai, Language model information retrieval with document expansion, Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.407-414, June 04-09, 2006, New York, New York
[doi> 10.3115/1220835.1220887]
|
| |
58
|
|
| |
59
|
|
 |
60
|
|
 |
61
|
|
| |
62
|
Willett, P. 1985. Query specific automatic document classification. Int. For. Inform. Documentat. 10, 2, 28--32.
|
 |
63
|
|
 |
64
|
|
 |
65
|
|
 |
66
|
|
|