| Inverted index compression and query processing with optimized document ordering |
| Full text |
Pdf
(1.11 MB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 18th international conference on World wide web
table of contents
Madrid, Spain
SESSION: Search/session: query processing
table of contents
Pages 401-410
Year of Publication: 2009
ISBN:978-1-60558-487-4
|
|
Authors
|
|
Hao Yan
|
Polytechnic Institute of NYU, Brooklyn, NY, USA
|
|
Shuai Ding
|
Polytechnic Institute of NYU, Brooklyn, NY, USA
|
|
Torsten Suel
|
Yahoo! Research, Sunnyvale, CA, USA
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 41, Downloads (12 Months): 158, Citation Count: 0
|
|
|
ABSTRACT
Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005.
|
| |
7
|
|
| |
8
|
P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In Proc. of the 12th Int. Conf. on String Processing and Information Retrieval, 2005.
|
| |
9
|
|
 |
10
|
Andrei Z. Broder , David Carmel , Michael Herscovici , Aya Soffer , Jason Zien, Efficient query evaluation using a two-level retrieval process, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956944]
|
| |
11
|
A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, 2006.
|
 |
12
|
Flavio Chierichetti , Silvio Lattanzi , Federico Mari , Alessandro Panconesi, On placing skips optimally in expectation, Proceedings of the international conference on Web search and web data mining, February 11-12, 2008, Palo Alto, California, USA
[doi> 10.1145/1341531.1341537]
|
 |
13
|
|
| |
14
|
S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, July 2005.
|
| |
15
|
M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, 2007.
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
|
| |
24
|
F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007.
|
 |
25
|
|
| |
26
|
|
 |
27
|
|
 |
28
|
|
 |
29
|
|
| |
30
|
|
|