ACM Home Page
Please provide us with feedback. Feedback
Clusters, language models, and ad hoc information retrieval
Full text PdfPdf (1.47 MB)
Source
ACM Transactions on Information Systems (TOIS) archive
Volume 27 ,  Issue 3  (May 2009) table of contents
Article No. 13  
Year of Publication: 2009
ISSN:1046-8188
Authors
Oren Kurland  Technion—Israel Institute of Technology, Haifa, Israel
Lillian Lee  Cornell University, Ithaca, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 145,   Downloads (12 Months): 559,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1508850.1508851
What is a DOI?

ABSTRACT

The language-modeling approach to information retrieval provides an effective statistical framework for tackling various problems and often achieves impressive empirical performance. However, most previous work on language models for information retrieval focused on document-specific characteristics, and therefore did not take into account the structure of the surrounding corpus, a potentially rich source of additional information. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in terms of mean average precision (MAP) and recall, and our new interpolation algorithm posts statistically significant performance improvements for both metrics over all six corpora tested. An important aspect of our work is the way we model corpus structure. In contrast to most previous work on cluster-based retrieval that partitions the corpus, we demonstrate the effectiveness of a simple strategy based on a nearest-neighbors approach that produces overlapping clusters.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X., Smucker, M. D., and Wade, C. 2004. UMASS at TREC 2004—novelty and hard. In Proceedings of the 13th Text Retrieval Conference (TREC-13).
 
2
Allan, J., Connell, M. E., Croft, W. B., Feng, F.-F., Fisher, D., and Li, X. 2000. INQUERY and TREC-9. In Proceedings of the 9th Text Retrieval Conference (TREC-9). 551--562. NIST Special Publication 500-249. National Institute of Science and Technology, Gaithersburg, MD.
 
3
Azzopardi, L., Girolami, M., and van Rijsbergen, K. 2004. Topic based language models for ad hoc information retrieval. In Proceedings of the International Conference on Neural Networks and IEEE International Conference on Fuzzy Systems. 3281--3286.
 
4
 
5
 
6
 
7
Connell, M., Feng, A., Kumaran, G., Raghavan, H., Shah, C., and Allan, J. 2004. UMass at TDT 2004. TDT2004 System Description. In Proceedings of IDT 2004.
 
8
Croft, W. B. 1980. A model of cluster searching based on classification. Inform. Syst. 5, 189--195.
 
9
 
10
Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2004. A language modeling framework for selective query expansion. Tech. rep. IR-338, Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, MA.
 
11
Danilowicz, C. and Baliński, J. 2000. Document ranking based upon Markov chains. Inform. Process. Manage. 41, 4, 759--775.
12
 
13
14
 
15
 
16
17
 
18
Hiemstra, D. and Kraaij, W. 1999. Twenty-One at TREC7: Ad hoc and cross-language track. In Proceedings of the 7th Text Retrieval Conference (TREC-7). 227--238.
 
19
 
20
Hofmann, T. and Puzicha, J. 1998. Unsupervised learning from dyadic data. Tech. rep. TR-98-042. International Computer Science Institute (ICSI), Berkely, CA.
21
 
22
Indyk, P. 2004. Nearest neighbors in high-dimensional spaces, In Handbook of Discrete and Computational Geometry, 2nd ed., J. E. Goodman and J. O'Rourke, Eds. CRC Press, Boca Raton, FL, Chapter 39.
 
23
Iyer, R. and Ostendorf, M. 1999. Modeling long distance dependence in language: Topic mixtures vs. dynamic cache models. IEEE Trans. Speech Audio Process. 7, 1, 30--39.
 
24
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchic clustering in information retrieval. Inform. Stor. Retr. 7, 5, 217--240.
 
25
Jelinek, F. and Mercer, R. L. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice. North-Holland, Amsterdam, The Netherlands, 381--397.
 
26
Kraaij, W. and Spitters, M. 2003. Language models for topic tracking: The importance of score normalization. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty Eds., Kluwer, Norwell, MA, Chapter 5, 95--124.
 
27
28
29
30
31
32
 
33
Lavrenko, V. 2000. Localized smoothing of multinomial language models. Tech. rep. IR-222. Center for Intelligent Information Retrieval (CIIR), University of Massachusetts, Amherst, MA.
 
34
 
35
36
 
37
Lavrenko, V. and Croft, W. B. 2003. Relevance models in information retrieval. In Language Modeling for Information Retrieval, W. B. Croft and J. Lafferty Eds., Kluwer, Norwell, MA, 11--56.
38
39
40
41
 
42
Metzler, D. 2005. Direct maximization of rank-based metrics. Tech. rep. IR-338425. Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, MA.
43
 
44
Metzler, D., Diaz, F., Strohman, T., and Croft, W. B. 2005. Using mixtures of relevance models for query expansion. In Proceedings of the 14th Text Retrieval Conference. (TREC)
45
46
 
47
Morgan, W., Greiff, W., and Henderson, J. 2004. Direct maximization of average precision by hill-climbing, with a comparison to a maximum entropy approach. Tech. rep. 04-0367. The MITRE Corporation, Beckford, MA/McLean, VA.
 
48
Ng, K. 2000. A maximum likelihood ratio information retrieval model. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 483--492.
 
49
Ogilvie, P. 2000. Nearest neighbor smoothing of language models in IR. Unpublished. http://www.cs.cmu.edu/People/pto/courses/11-743/nnlmsmooth.ps.
 
50
51
 
52
Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewood Cliffs, NJ, 313--323.
 
53
54
55
 
56
Spitters, M. and Kraaij, W. 2001. TNO at TDT2001: Language model-based topic detection. In Proceedings of the Topic Detection and Tracking TDT Workshop.
 
57
 
58
 
59
60
61
 
62
Willett, P. 1985. Query specific automatic document classification. Int. For. Inform. Documentat. 10, 2, 28--32.
63
64
65
66

Collaborative Colleagues:
Oren Kurland: colleagues
Lillian Lee: colleagues