| Expediting RL by using graphical structures |
| Full text |
Pdf
(376 KB)
|
Source
|
International Conference on Autonomous Agents
archive
Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems - Volume 3
table of contents
Estoril, Portugal
SESSION: Agent and multi-agent learning
table of contents
Pages 1325-1328
Year of Publication: 2008
ISBN:978-0-9817381-2-X
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 3, Downloads (12 Months): 39, Citation Count: 0
|
|
|
ABSTRACT
The goal of Reinforcement learning (RL) is to maximize reward (minimize cost) in a Markov decision process (MDP) without knowing the underlying model a priori. RL algorithms tend to be much slower than planning algorithms, which require the model as input. Recent results demonstrate that MDP planning can be expedited, by exploiting the graphical structure of the MDP. We present extensions to two popular RL algorithms, Q-learning and RMax, that learn and exploit the graphical structure of problems to improve overall learning speed. Use of the graphical structure of the underlying MDP can greatly improve the speed of planning algorithms, if the underlying MDP has a nontrivial topological structure. Our experiments show that use of the apparent topological structure of an MDP speeds up reinforcement learning, even if the MDP is simply connected.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. J. of Artificial Intelligence Research, 11:1--94, 1999.
|
| |
3
|
|
| |
4
|
P. Dai and J. Goldsmith. Topological value iteration algorithm for Markov decision processes. In Proc. IJCAI-07, pages 1860--1865, 2007.
|
| |
5
|
|
| |
6
|
S. M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.
|
| |
7
|
|
| |
8
|
|
| |
9
|
R. Munos and A. Moore. Influence and variance of a Markov chain: Application to adaptive discretization in optimal control. In Proc. of IEEE Conference on Decision and Control, 1999.
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
C. J. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989.
|
| |
14
|
|
| |
15
|
|
|