| Compressing term positions in web indexes |
| Full text |
Pdf
(623 KB)
|
Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
table of contents
Boston, MA, USA
SESSION: Efficiency
table of contents
Pages 147-154
Year of Publication: 2009
ISBN:978-1-60558-483-6
|
|
Authors
|
|
Hao Yan
|
Polytechnic Institute of NYU, Brooklyn, NY, USA
|
|
Shuai Ding
|
Polytechnic Institute of NYU, Brooklyn, NY, USA
|
|
Torsten Suel
|
Polytechnic Institute of NYU, Brooklyn, NY, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 81, Downloads (12 Months): 183, Citation Count: 0
|
|
|
ABSTRACT
Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to efficiently access position data for ranking functions that take proximity features into account.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
V. Anh. Impact-based document retrieval. PhD Thesis, The University of Melbourne, Amsterdam, Netherlands, April 2004.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
A. Bookstein, S. Klein, and T. Raita. Markov models for clusters in concordance compression. In Proc. of the Data Compression Conference, March, 1994.
|
 |
8
|
|
| |
9
|
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 12(3):194--203, Mar. 1975.
|
| |
10
|
A. Fraenkel and S. Klein. Novel compression of sparse bit-strings--preliminary report. Combinatorial Algorithms on Words, 12:169--183, 1985.
|
| |
11
|
J. Gailly. zlib compression library. Available at http://www.gzip.org/zlib/.
|
| |
12
|
S. Golomb. Run-length encodingds. IEEE Transactions on Information Theory, 12(3):399--401, 1966.
|
| |
13
|
S. Heman. Super-scalar database compression between ram and cpu-cache. MS Thesis, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, Netherlands, July 2005.
|
 |
14
|
|
 |
15
|
|
| |
16
|
G. Mishne and M. Rijke. Boosting web retrieval through query operations. In Proc. of the 27th European Conference on IR Research, 2005.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
R. Schenkel, A. Broschart, S. Hwang, M. Theobald, and G. Weikum. Efficient text proximity search. In 14th String Processing and Information Retrieval Symposium, 2007.
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
H. Williams and J. Zobel. Compressing integers for fast file access. Computer Journal, 42(3):193--201, 1999.
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
| |
31
|
|
|