|
ABSTRACT
The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In ICDE, 2002.
|
| |
5
|
S. Büttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM, 2006.
|
 |
6
|
|
 |
7
|
David Carmel , Doron Cohen , Ronald Fagin , Eitan Farchi , Michael Herscovici , Yoelle S. Maarek , Aya Soffer, Static index pruning for information retrieval systems, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.43-50, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383958]
|
 |
8
|
|
| |
9
|
|
| |
10
|
Open directory. http://www.dmoz.org.
|
 |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
B. J. Jansen and A. Spink. An analysis of web documents retrieved and viewed. In International Conf. on Internet Computing, 2003.
|
 |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
Looksmart inc. http://www.looksmart.com.
|
 |
23
|
|
 |
24
|
|
 |
25
|
|
| |
26
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University.
|
| |
27
|
|
| |
28
|
M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems, 2002.
|
| |
29
|
S. Robertson and K. Spärck-Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--46, 1976.
|
| |
30
|
|
 |
31
|
Paricia Correia Saraiva , Edleno Silva de Moura , Novio Ziviani , Wagner Meira , Rodrigo Fonseca , Berthier Riberio-Neto, Rank-preserving two-level caching for scalable search engines, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.51-58, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383959]
|
| |
32
|
|
| |
33
|
|
CITED BY 9
|
|
Ricardo Baeza-Yates , Aristides Gionis , Flavio P. Junqueira , Vanessa Murdock , Vassilis Plachouras , Fabrizio Silvestri, Design trade-offs for search engine caching, ACM Transactions on the Web (TWEB), v.2 n.4, p.1-28, October 2008
|
|
|
|
|
|
Fabiano Atalla , Daniel Miranda , Jussara Almeida , Marcos André Gonçalves , Virgilio Almeida, Analyzing the impact of churn and malicious behavior on the quality of peer-to-peer web search, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
Mingjie Zhu , Shuming Shi , Nenghai Yu , Ji-Rong Wen, Can phrase indexing help to process non-phrase queries?, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|