ACM Home Page
Please provide us with feedback. Feedback
Mining query logs to optimize index partitioning in parallel web search engines
Full text PdfPdf (1.51 MB)
Source ACM International Conference Proceeding Series; Vol. 304 archive
Proceedings of the 2nd international conference on Scalable information systems table of contents
Suzhou, China
SESSION: Data mining II table of contents
Article No. 43  
Year of Publication: 2007
ISBN:978-1-59593-757-5
Authors
Claudio Lucchese  Università Ca' Foscari di Venezia, Venezia, Italy and ISTI-CNR, Pisa, Italy
Salvatore Orlando  Università Ca' Foscari di Venezia, Venezia, Italy
Raffaele Perego  ISTI-CNR, Pisa, Italy
Fabrizio Silvestri  ISTI-CNR, Pisa, Italy
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
SIGIR: ACM Special Interest Group on Information Retrieval
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
Publisher
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 62,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the inverted index among a set of parallel server nodes.

In this paper we are interested in devising an effective term-partitioning strategy, according to which the global vocabulary of terms and the associated inverted lists are split into disjoint subsets, and assigned to distinct servers. Due to the workload imbalance caused by the skewed distribution of terms in user queries, finding an effective partitioning strategy is considered a very complex task.

In this paper we first formally introduce Term Partitioning as a new optimization problem. Then we show how the knowledge mined from past WSE query logs can be profitably used to discover good solutions of this problem. Finally, we report many results to show that we are able to effectively reduce both the average number of servers activated per each query, along with the workload imbalance. Experiments are conducted on large query logs of real WSEs.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
 
6
 
7
8
 
9
 
10
A. Moffat and J. Zobel. What does it mean to 'measure performance'? In Proc. of the Intl. Conf. on Web Inf. Syst., pages 1--12, Brisbane, Australia, November 2004.
11
 
12
C. Santos-Badue, R. A. Baeza-Yates, B. A. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In Proc. of Intl. Symp. on String Process. and Inf. Retr., pages 10--20, 2001.
 
13
 
14
Y. Xie and D. O'Hallaron. Locality in search engine queries and its implications for caching. In Proc. of IEEE Infocom 2002, New York, USA, June 2002.
 
15


Collaborative Colleagues:
Claudio Lucchese: colleagues
Salvatore Orlando: colleagues
Raffaele Perego: colleagues
Fabrizio Silvestri: colleagues