|
ABSTRACT
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Altingovde, I. S., Can, F., and Ulusoy, Ö. 2006. Algorithms for within-cluster searches using inverted files. In Proceedings of the International Symposium on Computer and Information Sciences (ISCIS). 707--716.
|
 |
2
|
Ismail Sengor Altingovde , Rifat Ozcan , Huseyin Cagdas Ocalan , Fazli Can , Özgür Ulusoy, Large-scale cluster-based retrieval experiments on Turkish texts, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277961]
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
 |
6
|
|
 |
7
|
Ricardo Baeza-Yates , Aristides Gionis , Flavio Junqueira , Vanessa Murdock , Vassilis Plachouras , Fabrizio Silvestri, The impact of caching on search engines, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277775]
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
Cacheda, F. and Baeza-Yates, R. 2004. An optimistic model for searching Web directories. In Proceedings of the 26th European Conference on IR Research (ECIR). 364--377.
|
| |
14
|
Cacheda, F., Carneiro, V., Guerrero, C., and Viña, Á. 2003. Optimization of restricted searches in Web directories using hybrid data structures. In Proceedings of the 25th European Conference on IR Research (ECIR). 436--451.
|
| |
15
|
Cambazoglu, B. B. 2006. Models and algorithms for parallel text retrieval. Ph.D. thesis, Bilkent University, Ankara, Turkey.
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
Croft, W. B. 1980. A model of cluster searching based on classification. Inf. Syst. 5, 3, 189--195.
|
| |
22
|
|
| |
23
|
|
| |
24
|
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Inf. Storage Retr. 7, 217--240.
|
| |
25
|
Lester, N., Moffat, A., Webber, W., and Zobel, J. 2005. Space-Limited ranked query evaluation using adaptive pruning. In Proceedings of the 6th International Conference on Web Information Systems Engineering, New York, 470--477.
|
 |
26
|
|
| |
27
|
|
 |
28
|
|
| |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
 |
37
|
|
| |
38
|
Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR). 101--112.
|
| |
39
|
Stanford University. 2007. Webbase project Web site. www-diglib.stanford.edu/~testbed/doc2/WebBase.
|
 |
40
|
|
| |
41
|
Trec 2006. Trec. Web site. http://trec.nist.gov.
|
| |
42
|
Tombros, A. 2002. The effectiveness of query-based hierarchic clustering of documents for information retrieval. Ph.D. thesis, University of Glasgow, Glasgow, UK.
|
| |
43
|
|
 |
44
|
|
| |
45
|
|
 |
46
|
|
| |
47
|
|
| |
48
|
|
| |
49
|
Zettair 2007. The Zettair search engine. http://www.seg.rmit.edu.au/zettair/.
|
 |
50
|
|
|