|
ABSTRACT
In several agent-oriented scenarios in the real world, an autonomous agent that is situated in an unknown environment must learn through a process of trial and error to take actions that result in long-term benefit. Reinforcement Learning (or sequential decision making) is a paradigm well-suited to this requirement. Value function-based methods and policy search methods are contrasting approaches to solve reinforcement learning tasks. While both classes of methods benefit from independent theoretical analyses, these often fail to extend to the practical situations in which the methods are deployed. We conduct an empirical study to examine the strengths and weaknesses of these approaches by introducing a suite of test domains that can be varied for problem size, stochasticity, function approximation, and partial observability. Our results indicate clear patterns in the domain characteristics for which each class of methods excels. We investigate whether their strengths can be combined, and develop an approach to achieve that purpose. The effectiveness of this approach is also demonstrated on the challenging benchmark task of robot soccer Keepaway. We highlight several lines of inquiry that emanate from this study.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319--350, 2001.
|
| |
2
|
R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, June 1957.
|
| |
3
|
S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-critic algorithms. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 105--112. MIT Press, Cambridge, MA, 2008.
|
| |
4
|
|
| |
5
|
R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, NIPS, Denver, CO, November 27--30, 1995, pages 1017--1023. MIT Press, 1996.
|
| |
6
|
P. T. De Boer, D. P. Kroese, S. Mannor, and R. Rubinstein. A tutorial on the cross-entropy method. Annals of Operations Research, 134(1):19--67, 2005.
|
| |
7
|
S. Kakade. A natural policy gradient. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1531--1538. MIT Press, 2001.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry. Autonomous helicopter flight via reinforcement learning. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
|
| |
12
|
T. J. Perkins and D. Precup. A convergent form of approximate policy iteration. In S. T. S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1595--1602. MIT Press, Cambridge, MA, 2003.
|
| |
13
|
|
| |
14
|
G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994.
|
| |
15
|
K. O. Stanley. Efficient evolution of neural networks through complexification. Technical Report AI-TR-04-314, Department of Computer Sciences, University of Texas at Austin, August 2004.
|
| |
16
|
P. Stone, R. S. Sutton, and G. Kuhlmann. Reinforcement learning for RoboCup-soccer keepaway. Adaptive Behavior, 13(3):165--188, 2005.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
|