|
ABSTRACT
How should a decision-maker perform repeated choices so as to optimize the average cost or benefit of those choices in the long run? This question motivates the theory of online learning, which encompasses problems such as the well-known best-expert [13, 9] and multi-armed bandit [10, 1] problems. This paper concerns a new approach to dealing with multi-armed bandit problems in which the decision-maker's strategy set is large (exponential or possibly infinite). Recent theoretical progress on the analysis of algorithms for such problems (e.g. [2, 3, 8, 11, 14]) has led to improved online algorithms for problems in areas such as online routing [2], dynamic pricing mechanisms [4, 5, 12], and analysis of reputation systems in e-commerce and peer-to-peer networks [3].
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
B. Awerbuch and R. Kleinberg, Competitive collaborative learning, in Proceedings of the 18th Annual Conference on Learning Theory (COLT), 2005. To appear.
|
| |
4
|
|
| |
5
|
|
| |
6
|
M. Boddy, Anytime problem solving using dynamic programming, Proceedings of National Conference on Artificial Intelligence (AAAI), (1991).
|
| |
7
|
E. Cope, Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces, 2004. Unpublished manuscript.
|
| |
8
|
|
| |
9
|
|
| |
10
|
J. C. Gittins and D. M. Jones, A dynamic allocation index for the sequential design of experiments, in Progress in Statistics, J. G. et al., ed., North-Holland, 1974, pp. 241--266.
|
| |
11
|
R. Kleinberg, Nearly tight bounds for the continuum-armed bandit problem, in Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou, eds., MIT Press, Cambridge, MA, 2005, pp. 697--704.
|
| |
12
|
|
| |
13
|
|
| |
14
|
H. B. McMahan and A. Blum, Online geometric optimization in the bandit setting against an adaptive adversary, in Proceedings of the 17th Annual Conference on Learning Theory (COLT), vol. 3120 of Lecture Notes in Computer Science, Springer Verlag, 2004, pp. 109--123.
|
| |
15
|
S. Zilberstein, Operational rationality through compilation of anytime algorithms, 1993.
|
CITED BY 4
|
|
Avrim Blum , MohammadTaghi Hajiaghayi , Katrina Ligett , Aaron Roth, Regret minimization and the price of total anarchy, Proceedings of the 40th annual ACM symposium on Theory of computing, May 17-20, 2008, Victoria, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|