|
ABSTRACT
How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN's search logs to estimate entropy. The perplexity is surprisingly small: millions, not billions. Entropy is a powerful tool for sizing challenges and opportunities. How hard is search? How hard are query suggestion mechanisms like auto-complete? How much does personalization help? All these difficult questions can be answered by estimation of entropy from search logs. What is the potential opportunity for personalization? In this paper, we propose a new way to personalize search, personalization with backoff. If we have relevant data for a particular user, we should use it. But if we don't, back off to larger and larger classes of similar users. As a proof of concept, we use the first few bytes of the IP address to define classes. The coefficients of each backoff class are estimated with an EM algorithm. Ideally, classes would be defined by market segments, demographics and surrogate variables such as time and geography
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
N. Chomsky. Syntactic Structures. The Hague/Paris: Mouton, 1957.
|
| |
5
|
K. Church and W. Gale. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of english bigrams. Computer Speech and Language, 5(1):19--54, 1991.
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1--38, 1977.
|
 |
10
|
|
| |
11
|
R. Gallager. Claude E. Shannon: A retrospective on his life, work, and impact. IEEE Transactions on Information Theory, 47(7):2681--2695, 2001.
|
 |
12
|
|
 |
13
|
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speeech and Signal Processing, 35(3):400--401, 1987.
|
| |
18
|
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.
|
| |
19
|
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.
|
 |
20
|
|
 |
21
|
|
| |
22
|
C. Sagan. Billions and Billions: Thoughts on Life and Death at the Brink of the Millennium. New York: Random House, 1997.
|
| |
23
|
C. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379--423, 623¿656, 1948.
|
| |
24
|
C. Shannon. Prediction and entropy of printed english. Bell Systems Technical Journal, 30:50--64, 1951.
|
 |
25
|
|
 |
26
|
|
 |
27
|
|
 |
28
|
|
 |
29
|
|
| |
30
|
|
 |
31
|
|
CITED BY 10
|
|
Meredith Ringel Morris , Jaime Teevan , Steve Bush, Enhancing collaborative web search with personalization: groupization, smart splitting, and group hit-highlighting, Proceedings of the ACM 2008 conference on Computer supported cooperative work, November 08-12, 2008, San Diego, CA, USA
|
|
|
|
|
|
Doug Downey , Susan Dumais , Dan Liebling , Eric Horvitz, Understanding the relationship between searchers' queries and information goals, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|