ACM Home Page
Please provide us with feedback. Feedback
Combining expert advice in reactive environments
Full text PdfPdf (247 KB)
Source Journal of the ACM (JACM) archive
Volume 53 ,  Issue 5  (September 2006) table of contents
Pages: 762 - 799  
Year of Publication: 2006
ISSN:0004-5411
Authors
Daniela Pucci De Farias  Massachusetts Institute of Technology, Cambridge, Massachusetts
Nimrod Megiddo  IBM Almaden Research Center, San Jose, California
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 82,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183907.1183911
What is a DOI?

ABSTRACT

“Experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any fixed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game.A general exploration-exploitation experts method is presented along with a proper definition of value. The definition is shown to be adequate in that it both captures the impact of an expert's actions on the environment and is learnable. The new experts method is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player's actions on the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner's Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493--507.
 
4
de Farias, D., and Megiddo, N. 2004. How to combine expert (or novice) advice when actions impact the environment. In Advances in Neural Information Processing Systems, Vol. 16.
 
5
Feller, W. 1971. Probability Theory and its Applications. Wiley, New York.
 
6
 
7
Foster, D. and Vohra, R. 1999. Regret and the on-line decision problem. Games Econ. Behav. 29, 7--35.
 
8
 
9
Freund, Y., and Schapire, R. E. 1999. Adaptive game playing using multiplicative weights. Games Econ. Behav. 29, 79--103.
 
10
Fudenberg, D., and Levine, D. 1997. The Theory of Learning in Games. The MIT Press, Cambridge, MA.
 
11
Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. ASA 58, 13--30.
 
12
Kakade, S. 2003. On the sample complexity of reinforcement learning. Ph.D. dissertation, Gatsby Computational Neuroscience Unit, University College, London, England.
 
13
 
14
 
15
Lai, T.-L., and Yakowitz, S. 1995. Machine learning and nonparametric bandit theory. IEEE Trans. Automat. Cont. 40, 7, 1199--1209.
 
16
 
17
 
18
 
19
Williams, D. 1991. Probability with Martingales. Cambridge University Press, Cambridge, UK.


Collaborative Colleagues:
Daniela Pucci De Farias: colleagues
Nimrod Megiddo: colleagues