|
ABSTRACT
This article discusses a novel approach developed for static index pruning that takes into account the locality of occurrences of words in the text. We use this new approach to propose and experiment on simple and effective pruning methods that allow a fast construction of the pruned index. The methods proposed here are especially useful for pruning in environments where the document database changes continuously, such as large-scale web search engines. Extensive experiments are presented showing that the proposed methods can achieve high compression rates while maintaining the quality of results for the most common query types present in modern search engines, namely, conjunctive and phrase queries. In the experiments, our locality-based pruning approach allowed reducing search engine indices to 30% of their original size, with almost no reduction in precision at the top answers. Furthermore, we conclude that even an extremely simple locality-based pruning method can be competitive when compared to complex methods that do not rely on locality information.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Anderson, T. W. and Finn, J. D. 1997. The New Statistical Analysis of Data, 1st ed. Springer.
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
David Carmel , Doron Cohen , Ronald Fagin , Eitan Farchi , Michael Herscovici , Yoelle S. Maarek , Aya Soffer, Static index pruning for information retrieval systems, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.43-50, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383958]
|
 |
11
|
|
 |
12
|
Edleno S. de Moura , Célia F. dos Santos , Daniel R. Fernandes , Altigran S. Silva , Pavel Calado , Mario A. Nascimento, Improving Web search efficiency via a locality based static pruning method, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060783]
|
 |
13
|
|
| |
14
|
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.
|
 |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
Hawking, D., Voorhees, E., Bailey, P., and Craswell, N. 1999. Overview of Trec-8 Web track. In Proceedings of the Text Retrieval Conference (TREC-8). Gaithersburg, MD, 131--150.
|
| |
19
|
Hawking, D., Craswell, N., and Thistlewaite, P. B. 1998. Overview of TREC-7 very large collection track. In Proceedings of the Text Retrieval Conference (TREC-7), Gaithersburg, MD, 91--104.
|
| |
20
|
Hovy, E. H. and Lin, C.-Y. 1998. Automated text summarization in SUMMARIST. In Advances in Automated Text Summarization, I. Mani and M. Maybury, eds. MIT Press, 81--94.
|
| |
21
|
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
 |
28
|
|
 |
29
|
|
 |
30
|
|
 |
31
|
Gerard Salton , J. Allan , Chris Buckley, Approaches to passage retrieval in full text information systems, Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, p.49-58, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/160688.160693]
|
| |
32
|
|
 |
33
|
|
| |
34
|
Silverstein, C., Henzinger, M., Marais, H., and Moricz, M. 1998. Analysis of a very large Altavista query log. Tech. Rep. 14, Systems Research Center Laboratory. October.
|
 |
35
|
|
| |
36
|
|
|