| Heavy-tailed distributions and multi-keyword queries |
| Full text |
Pdf
(504 KB)
|
Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Amsterdam, The Netherlands
SESSION: Query processing strategies
table of contents
Pages: 663 - 670
Year of Publication: 2007
ISBN:978-1-59593-597-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 10, Downloads (12 Months): 122, Citation Count: 0
|
|
|
ABSTRACT
Intersecting inverted indexes is a fundamental operation for many applications in information retrieval and databases. Efficient indexing for this operation is known to be a hard problem for arbitrary data distributions. However, text corpora used in Information Retrieval applications often have convenient power-law constraints (also known as Zipf's Law and long tails) that allow us to materialize carefully chosen combinations of multi-keyword indexes, which significantly improve worst-case performance without requiring excessive storage. These multi-keyword indexes limit the number of postings accessed when computing arbitrary index intersections. Our evaluation on an e-commerce collection of 20 million products shows that the indexes of up to four arbitrary keywords can be intersected while accessing less than 20% of the postings in the largest single-keyword index.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Polarity dataset v2.0. http://www.cs.cornell.edu/people/pabo/movie-review-data/.
|
 |
3
|
|
| |
4
|
H. Baayen. Word Frequency Distributions, volume 18 of Text, Speech and Language Technology. Kulver Academic Publishers, 2001.
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
T. S. Jayram , Subhash Khot , Ravi Kumar , Yuval Rabani, Cell-probe lower bounds for the partial match problem, Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, June 09-11, 2003, San Diego, CA, USA
[doi> 10.1145/780542.780639]
|
| |
9
|
P. Li, T. Hastie, and K. W. Church. Using Sketches to Estimate Two-Way and Multi-Way Associations. Technical report, Microsoft Research, 2006.
|
| |
10
|
P. Li, K. W. Church and T. Hastie. Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data. NIPS, 2006.
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
Yahoo. Analysts day report. available at http://yhoo.client.shareholder.com/downloads/2006AnalystsDay.pdf, 2006.
|
|