| Exploration scavenging |
| Full text |
Pdf
(267 KB)
|
| Source
|
ICML; Vol. 307
archive
Proceedings of the 25th international conference on Machine learning
table of contents
Helsinki, Finland
Pages 528-535
Year of Publication: 2008
ISBN:978-1-60558-205-4
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 31, Citation Count: 0
|
|
|
ABSTRACT
We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide non-trivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alon, N., & Spencer, J. (2000). The probabilistic method. Interscience Series in Discrete Mathematics and Optimization. John Wiley. Second edition.
|
| |
2
|
|
| |
3
|
|
| |
4
|
Berry, D. A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments. London, UK: Chapman and Hall.
|
 |
5
|
Christian Borgs , Jennifer Chayes , Nicole Immorlica , Kamal Jain , Omid Etesami , Mohammad Mahdian, Dynamics of bid optimization in online advertisement auctions, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242644]
|
| |
6
|
Dupret, G., Murdock, V., & Piwowarski, B. (2007). Web search engine evaluation using clickthrough data and a user model. 16th Intl. World Wide Web Conference.
|
| |
7
|
|
| |
8
|
Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153--161.
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
Langford, J., & Zhang, T. (2007). The epoch greedy algorithm for contextual multi-armed bandits. Advances in Neural Information Processing Systems.
|
| |
14
|
Wang, C.-C., Kulkarni, S. R., & Poor, H. V. (2005). Bandit problems with side observations. IEEE Transactions on Automatic Control, 50, 338--355.
|
|