|
ABSTRACT
In this paper we propose a new formal model for studying reinforcement learning, based on Valiant's PAC framework.
In our model the learner does not have direct access to every state of the environment. Instead, every sequence of experiments starts in a fixed initial state and the learner is provided with a “reset” operation that interrupts the current sequence of experiments and starts a new one (from the initial state).
We do not require the agent to learn the optimal policy but only a good approximation of it with high probability. More precisely, we require the learner to produce a policy whose expected value from the initial state is &egr;-close to that of the optimal policy, with probability no less than 1−&dgr;.
For this model, we describe an algorithm that produces such an (&egr;,&dgr;)-optimal policy for any environment, in time polynomial in N,K,1/&egr;,1/&dgr;,1/(1−&bgr;) and rmax, where N is the number of states of the environment, K is the maximum number of actions in a state, &bgr; is the discount factor and rmax is the maximum reward on any transition.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A.G. Barto, R.S. Sutton, and C.W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 13:834-846, 1983.
|
| |
2
|
A.G. Barto, R.S. Sutton, and C. Watkins. Learning and sequential decision making. In M. Gabriel and J.W. Moore, editors, Learning and Computational Ncuroacie~cc. MIT Press, Cambridge, MA, 1990.
|
| |
3
|
|
| |
4
|
J.J. Grefenstette. Credit assignment in rule discovery systems based on genetic algorithms. Machine Learning, 5:355-382, 1989.
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Wassily Hoeffding. Probability inequalities for sum of bounded random variables. Journal of the Amer. ican Statistical Association, 58(301), 1963.
|
| |
9
|
J.H. Holland. Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R.S. Michalski, }.G. Carbonell, and T.M. Mitchell, editors, Machine Learning: An Artfcial Intelligence Approach (Volume II). Morgan Kaufmann, San Mateo, CA, 1986.
|
| |
10
|
S. Koenig and R.G. Simmons. Complexity analysis of real-time reinforcement learning. In Proceedings of the Tenth International Conference on Machine Learning, pages 99-105. Morgan Kaufmann, 1993.
|
| |
11
|
H. Mine and S. Osaki. Markovian Decision Processes. Modern Analytic and Computational Methods in Science and Mathematics. Elsevier, New York, NY, 1970.
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
C. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, MA, 1989.
|
| |
18
|
|
CITED BY 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alexander L. Strehl , Lihong Li , Eric Wiewiora , John Langford , Michael L. Littman, PAC model-free reinforcement learning, Proceedings of the 23rd international conference on Machine learning, p.881-888, June 25-29, 2006, Pittsburgh, Pennsylvania
|
|
|
Bethany R. Leffler , Michael L. Littman , Timothy Edmunds, Efficient reinforcement learning with relocatable action models, Proceedings of the 22nd national conference on Artificial intelligence, p.572-577, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
REVIEW
"Jaak Tepandi : Reviewer"
A new formal model for studying reinforcement learning is
presented. In reinforcement learning, an agent learns a task that
involves a sequence of actions, by interacting with the environment. The
agent learns to associate appropriate actions
more...
|