| Mining query logs to optimize index partitioning in parallel web search engines |
| Full text |
Pdf
(1.51 MB)
|
| Source
|
ACM International Conference Proceeding Series; Vol. 304
archive
Proceedings of the 2nd international conference on Scalable information systems
table of contents
Suzhou, China
SESSION: Data mining II
table of contents
Article No. 43
Year of Publication: 2007
ISBN:978-1-59593-757-5
|
|
Authors
|
|
Claudio Lucchese
|
Università Ca' Foscari di Venezia, Venezia, Italy and ISTI-CNR, Pisa, Italy
|
|
Salvatore Orlando
|
Università Ca' Foscari di Venezia, Venezia, Italy
|
|
Raffaele Perego
|
ISTI-CNR, Pisa, Italy
|
|
Fabrizio Silvestri
|
ISTI-CNR, Pisa, Italy
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 3, Downloads (12 Months): 62, Citation Count: 1
|
|
|
ABSTRACT
Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the inverted index among a set of parallel server nodes. In this paper we are interested in devising an effective term-partitioning strategy, according to which the global vocabulary of terms and the associated inverted lists are split into disjoint subsets, and assigned to distinct servers. Due to the workload imbalance caused by the skewed distribution of terms in user queries, finding an effective partitioning strategy is considered a very complex task. In this paper we first formally introduce Term Partitioning as a new optimization problem. Then we show how the knowledge mined from past WSE query logs can be profitably used to discover good solutions of this problem. Finally, we report many results to show that we are able to effectively reduce both the average number of servers activated per each query, along with the workload imbalance. Experiments are conducted on large query logs of real WSEs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
A. Moffat and J. Zobel. What does it mean to 'measure performance'? In Proc. of the Intl. Conf. on Web Inf. Syst., pages 1--12, Brisbane, Australia, November 2004.
|
 |
11
|
Berthier Ribeiro-Neto , Edleno S. Moura , Marden S. Neubert , Nivio Ziviani, Efficient distributed algorithms to build inverted files, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.105-112, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312663]
|
| |
12
|
C. Santos-Badue, R. A. Baeza-Yates, B. A. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In Proc. of Intl. Symp. on String Process. and Inf. Retr., pages 10--20, 2001.
|
| |
13
|
|
| |
14
|
Y. Xie and D. O'Hallaron. Locality in search engine queries and its implications for caching. In Proc. of IEEE Infocom 2002, New York, USA, June 2002.
|
| |
15
|
|
CITED BY
|
|
Gleb Skobeltsyn , Toan Luu , Ivana Podnar arko , Martin Rajman , Karl Aberer, Query-driven indexing for scalable peer-to-peer text retrieval, Future Generation Computer Systems, v.25 n.1, p.89-99, January, 2009
|
|