ACM Home Page
Please provide us with feedback. Feedback
Better algorithms for benign bandits
Full text PdfPdf (476 KB)
Source Symposium on Discrete Algorithms archive
Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms table of contents
New York, New York
Pages 38-47  
Year of Publication: 2009
Authors
Elad Hazan  IBM Almaden, San Jose, CA
Satyen Kale  Microsoft Research, One Microsoft Way, Redmond, WA
Publisher
Society for Industrial and Applied Mathematics  Philadelphia, PA, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 71,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information.

Perhaps the most general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some Euclidean space, and the cost functions are linear. Only recently an efficient algorithm attaining Õ (√T) regret was discovered in this setting.

In this paper we propose a new algorithm for the bandit linear optimization problem which obtains a regret bound of Õ (√Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In COLT, 2008.
 
2
3
 
4
 
5
 
6
 
7
V. Dani, T. Hayes, and S. Kakade. The price of bandit information for online optimization. In NIPS, 2008.
8
 
9
 
10
J. Hannan. Approximation to bayes risk in repeated play. In M. Dresher, A. W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume III, pages 97--139, 1957.
 
11
E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In COLT, 2008.
 
12
E. Hazan and S. Kale. How to invest in calmer markets (without compromising universality). Manuscript, 2008.
 
13
 
14
H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In COLT, pages 109--123, 2004.
 
15
A. S. Nemirovskii. Interior point polynomial time methods in convex programming, 2004. Lecture Notes.
 
16
Y. E. Nesterov and A. S. Nemirovskii. Interior Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia, 1994.
 
17
H. Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5):527--535, 1952.
18