ACM Home Page
Please provide us with feedback. Feedback
Exploration scavenging
Full text PdfPdf (267 KB)
Source ICML; Vol. 307 archive
Proceedings of the 25th international conference on Machine learning table of contents
Helsinki, Finland
Pages 528-535  
Year of Publication: 2008
ISBN:978-1-60558-205-4
Authors
John Langford  Yahoo! Research, New York, New York
Alexander Strehl  Yahoo! Research, New York, New York
Jennifer Wortman  University of Pennsylvania, Philadelphia, PA
Sponsors
: Yahoo!
: Xerox
IBM : IBM
: NSF
Microsoft Research : Microsoft Research
: Machine Learning Journal/Springer
: Pascal
: University of Helsinki
: Federation of Finnish Learned Societies
: Intel Corporation
: Google
: Helsinki Institute for Information Technology
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 31,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1390156.1390223
What is a DOI?

ABSTRACT

We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide non-trivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Alon, N., & Spencer, J. (2000). The probabilistic method. Interscience Series in Discrete Mathematics and Optimization. John Wiley. Second edition.
 
2
 
3
 
4
Berry, D. A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments. London, UK: Chapman and Hall.
5
 
6
Dupret, G., Murdock, V., & Piwowarski, B. (2007). Web search engine evaluation using clickthrough data and a user model. 16th Intl. World Wide Web Conference.
 
7
 
8
Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153--161.
9
 
10
11
 
12
 
13
Langford, J., & Zhang, T. (2007). The epoch greedy algorithm for contextual multi-armed bandits. Advances in Neural Information Processing Systems.
 
14
Wang, C.-C., Kulkarni, S. R., & Poor, H. V. (2005). Bandit problems with side observations. IEEE Transactions on Automatic Control, 50, 338--355.

Collaborative Colleagues:
John Langford: colleagues
Alexander Strehl: colleagues
Jennifer Wortman: colleagues