ACM Home Page
Please provide us with feedback. Feedback
Entropy of search logs: how hard is search? with personalization? with backoff?
Full text PdfPdf (514 KB)
Source
Web Search and Web Data Mining archive
Proceedings of the international conference on Web search and web data mining table of contents
Palo Alto, California, USA
SESSION: Indexing and search table of contents
Pages 45-54  
Year of Publication: 2008
ISBN:978-1-59593-927-9
Authors
Qiaozhu Mei  University of Illinois at Urbana Champaign, Urbana, IL
Kenneth Church  Microsoft Research, Redmond, WA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 190,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1341531.1341540
What is a DOI?

ABSTRACT

How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN's search logs to estimate entropy. The perplexity is surprisingly small: millions, not billions.

Entropy is a powerful tool for sizing challenges and opportunities. How hard is search? How hard are query suggestion mechanisms like auto-complete? How much does personalization help? All these difficult questions can be answered by estimation of entropy from search logs.

What is the potential opportunity for personalization? In this paper, we propose a new way to personalize search, personalization with backoff. If we have relevant data for a particular user, we should use it. But if we don't, back off to larger and larger classes of similar users. As a proof of concept, we use the first few bytes of the IP address to define classes. The coefficients of each backoff class are estimated with an EM algorithm. Ideally, classes would be defined by market segments, demographics and surrogate variables such as time and geography


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
N. Chomsky. Syntactic Structures. The Hague/Paris: Mouton, 1957.
 
5
K. Church and W. Gale. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of english bigrams. Computer Speech and Language, 5(1):19--54, 1991.
 
6
 
7
 
8
 
9
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1--38, 1977.
10
 
11
R. Gallager. Claude E. Shannon: A retrospective on his life, work, and impact. IEEE Transactions on Information Theory, 47(7):2681--2695, 2001.
12
13
14
15
16
 
17
S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speeech and Signal Processing, 35(3):400--401, 1987.
 
18
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.
 
19
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.
20
21
 
22
C. Sagan. Billions and Billions: Thoughts on Life and Death at the Brink of the Millennium. New York: Random House, 1997.
 
23
C. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379--423, 623¿656, 1948.
 
24
C. Shannon. Prediction and entropy of printed english. Bell Systems Technical Journal, 30:50--64, 1951.
25
26
27
28
29
 
30
31

CITED BY  10

Collaborative Colleagues:
Qiaozhu Mei: colleagues
Kenneth Church: colleagues