ACM Home Page
Please provide us with feedback. Feedback
Compressed web indexes
Full text PdfPdf (721 KB)
Source
International World Wide Web Conference archive
Proceedings of the 18th international conference on World wide web table of contents
Madrid, Spain
SESSION: Search/session: caching and indices table of contents
Pages 451-460  
Year of Publication: 2009
ISBN:978-1-60558-487-4
Authors
Flavio Chierichetti  Sapienza University of Rome, Rome, Italy
Ravi Kumar  Yahoo! Research, Sunnyvale, CA, USA
Prabhakar Raghavan  Yahoo! Research, Sunnyvale, CA, USA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 109,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1526709.1526770
What is a DOI?

ABSTRACT

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, 1976.
 
2
 
3
 
4
 
5
6
 
7
 
8
 
9
L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of Zipf's law to word and character n-grams for English and Chinese. Computational Linguistics and Chinese Language Processing, 8(1):77--102, 2003.
 
10
 
11
 
12
W. Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6):1842--1845, 1992.
 
13
B. Mandelbrot. An information theory of the statistical structure of language. In W. Jackson, editor, Communication Theory, pages 486--502. Academic Press, 1953.
 
14
 
15
 
16
M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--333, 2003.
 
17
M. Molloy and B. Reed. Graph Coloring and the Probabilistic Method. Springer-Verlag, 2002.
 
18
 
19
W. J. Reed and M. Jorgensen. The double Pareto-lognormal distribution - A new parametric model for size distributions. Communications in Statistics: Theory and Methods, 33(8):1733--1753, 2004.
 
20
21
 
22
H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.
 
23
 
24
D. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton, 2003.
 
25
H. E. Williams and J. Zobel. Searchable words on the web. International Journal on Digital Libraries, 5(2):99--105, 2005.
 
26
 
27
 
28
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge MA, 1949

Collaborative Colleagues:
Flavio Chierichetti: colleagues
Ravi Kumar: colleagues
Prabhakar Raghavan: colleagues