ACM Home Page
Please provide us with feedback. Feedback
Heavy-tailed distributions and multi-keyword queries
Full text PdfPdf (504 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Amsterdam, The Netherlands
SESSION: Query processing strategies table of contents
Pages: 663 - 670  
Year of Publication: 2007
ISBN:978-1-59593-597-7
Authors
Surajit Chaudhuri  Microsoft Corporation, Redmond, WA
Kenneth Church  Microsoft Corporation, Redmond, WA
Arnd Christian König  Microsoft Corporation, Redmond, WA
Liying Sui  Microsoft Corporation, Redmond, WA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 122,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1277741.1277855
What is a DOI?

ABSTRACT

Intersecting inverted indexes is a fundamental operation for many applications in information retrieval and databases. Efficient indexing for this operation is known to be a hard problem for arbitrary data distributions. However, text corpora used in Information Retrieval applications often have convenient power-law constraints (also known as Zipf's Law and long tails) that allow us to materialize carefully chosen combinations of multi-keyword indexes, which significantly improve worst-case performance without requiring excessive storage. These multi-keyword indexes limit the number of postings accessed when computing arbitrary index intersections. Our evaluation on an e-commerce collection of 20 million products shows that the indexes of up to four arbitrary keywords can be intersected while accessing less than 20% of the postings in the largest single-keyword index.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Polarity dataset v2.0. http://www.cs.cornell.edu/people/pabo/movie-review-data/.
3
 
4
H. Baayen. Word Frequency Distributions, volume 18 of Text, Speech and Language Technology. Kulver Academic Publishers, 2001.
5
 
6
 
7
8
 
9
P. Li, T. Hastie, and K. W. Church. Using Sketches to Estimate Two-Way and Multi-Way Associations. Technical report, Microsoft Research, 2006.
 
10
P. Li, K. W. Church and T. Hastie. Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data. NIPS, 2006.
 
11
12
 
13
 
14
Yahoo. Analysts day report. available at http://yhoo.client.shareholder.com/downloads/2006AnalystsDay.pdf, 2006.

Collaborative Colleagues:
Surajit Chaudhuri: colleagues
Kenneth Church: colleagues
Arnd Christian König: colleagues
Liying Sui: colleagues