ACM Home Page
Please provide us with feedback. Feedback
Inverted index compression and query processing with optimized document ordering
Full text PdfPdf (1.11 MB)
Source
International World Wide Web Conference archive
Proceedings of the 18th international conference on World wide web table of contents
Madrid, Spain
SESSION: Search/session: query processing table of contents
Pages 401-410  
Year of Publication: 2009
ISBN:978-1-60558-487-4
Authors
Hao Yan  Polytechnic Institute of NYU, Brooklyn, NY, USA
Shuai Ding  Polytechnic Institute of NYU, Brooklyn, NY, USA
Torsten Suel  Yahoo! Research, Sunnyvale, CA, USA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 41,   Downloads (12 Months): 158,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1526709.1526764
What is a DOI?

ABSTRACT

Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
4
5
 
6
R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005.
 
7
 
8
P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In Proc. of the 12th Int. Conf. on String Processing and Information Retrieval, 2005.
 
9
10
 
11
A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, 2006.
12
13
 
14
S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, July 2005.
 
15
M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, 2007.
 
16
 
17
18
 
19
20
 
21
22
 
23
 
24
F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007.
25
 
26
27
28
29
 
30

Collaborative Colleagues:
Hao Yan: colleagues
Shuai Ding: colleagues
Torsten Suel: colleagues