|
ABSTRACT
We consider a graph theoretic approach for automatic construction of options in a dynamic environment. A map of the environment is generated on-line by the learning agent, representing the topological structure of the state transitions. A clustering algorithm is then used to partition the state space to different regions. Policies for reaching the different parts of the space are separately learned and added to the model in a form of options (macro-actions). The options are used for accelerating the Q-Learning algorithm. We extend the basic algorithm and consider building a map that includes preliminary indication of the location of "interesting" regions of the state space, where the value gradient is significant and additional exploration might be beneficial. Experiments indicate significant speedups, especially in the initial learning phase.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Anderberg, M. (1973). Cluster analysis for applications. Academic Press.
|
| |
2
|
|
| |
3
|
Barto, A., Sutton, R., & Anderson, C. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834--846.
|
| |
4
|
Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319--350.
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227--303.
|
| |
9
|
|
| |
10
|
Ernst, D., Geurts, P., & Wehenkel, L. (2003). Iteratively extending time horizon reinforcement learning. Proceedings of the 14th European Conference on Machine Learning (pp. 96--107).
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
McGovern, A., Sutton, R. S., & Fagg, A. H. (1997). Roles of macro-actions in accelerating reinforcement learning. Proceedings of the 1997 Grace Hopper Celebration of Women in Computing (pp. 13--18).
|
| |
17
|
|
| |
18
|
Moriarty, D., Schultz, A., & Grefenstette, J. (1999). Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research, 11, 199--229.
|
| |
19
|
|
| |
20
|
Theocharous, G., & Kaelbling, L. P. (2003). Approximate planning in POMDPs with macro-actions. To appear in Advances in Neural Processing Information Systems 17.
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peng Zang , Peng Zhou , David Minnen , Charles Isbell, Discovering options from example trajectories, Proceedings of the 26th Annual International Conference on Machine Learning, p.1217-1224, June 14-18, 2009, Montreal, Quebec, Canada
|
|
|
|
|
|
|
|
|
|
|