|
ABSTRACT
We provide a uniform framework for learning against a recent history adversary in arbitrary repeated bimatrix games, by modeling such an agent as a Markov Decision Process. We focus on learning an optimal non-stationary policy in such an MDP over a finite horizon and adapt an existing efficient Monte Carlo based algorithm for learning optimal policies in such MDPs. We show that this new efficient algorithm can obtain higher average rewards than a previously known efficient algorithm against some opponents in the contract game. Though this improvement comes at the cost of increased domain knowledge, a simple experiment in the Prisoner's Dilemma game shows that even when no extra domain knowledge (besides that the opponent's memory size is known) is assumed, the error can still be small.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Axelrod. The Evolution of Cooperation. Basic Books, 1984.
|
| |
2
|
J. A. Bagnell, S. Kakade, A. Y. Ng, and J. Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems 16. MIT Press, 2003.
|
| |
3
|
B. Banerjee and J. Peng. Performance bounded reinforcement learning in strategic interactions. In Proceedings of the 19th National Conference on Artificial Intelligence (AAAI-04), pages 2--7, San Jose, CA, 2004. AAAI Press.
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
D. Carmel and S. Markovitch. Learning models of intelligent agents. In Thirteenth National Conference on Artificial Intelligence, pages 62--67, Menlo Park, CA, 1996. AAAI Press/MIT Press.
|
| |
8
|
|
| |
9
|
D. Carmel and S. Markovitch. Model-based learning of interaction strategies in multiagent systems. Journal of Experimental and Theoretical Artificial Intelligence, 10(3):309 -- 332, 1998.
|
| |
10
|
V. Conitzer and T. Sandholm. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. In Proceedings of the 20th International Conference on Machine Learning, 2003.
|
 |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
Y. Freund , M. Kearns , Y. Mansour , D. Ron , R. Rubinfeld , R. E. Schapire, Efficient algorithms for learning to play repeated games against computationally bounded adversaries, Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS'95), p.332, October 23-25, 1995
|
| |
15
|
S. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003.
|
| |
16
|
M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large pomdps via reusable trajectories. In Advances in Neural Information Processing Systems 12. MIT Press, 2000.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
G. Tesauro. Extending Q-learning to general adaptive multi-agent systems. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
|
|