|
ABSTRACT
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them "curriculum learning". In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Bengio, Y. (2009). Learning deep architectures for AI. Foundations & Trends in Mach. Learn., to appear.
|
| |
3
|
Bengio, Y., Ducharme, R., & Vincent, P. (2001). A neural probabilistic language model. Adv. Neural Inf. Proc. Sys. 13 (pp. 932--938).
|
| |
4
|
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. Adv. Neural Inf. Proc. Sys. 19 (pp. 153--160).
|
| |
5
|
Cohn, D., Ghahramani, Z., & Jordan, M. (1995). Active learning with statistical models. Adv. Neural Inf. Proc. Sys. 7 (pp. 705--712).
|
| |
6
|
|
 |
7
|
|
| |
8
|
Derényi, I., Geszti, T., & Gyöörgyi, G. (1994). Generalization in the programed teaching of a perceptron. Physical Review E, 50, 3192--3200.
|
| |
9
|
Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 781--799.
|
| |
10
|
Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent, P. (2009). The difficulty of training deep architectures and the effect of unsupervised pre-training. AI & Stat. '2009.
|
| |
11
|
|
| |
12
|
Håstad, J., & Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational Complexity, 1, 113--129.
|
| |
13
|
|
| |
14
|
Hinton, G. E., & Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504--507.
|
| |
15
|
Krueger, K. A., & Dayan, P. (2009). Flexible shaping: how learning in small steps helps. Cognition, 110, 380--394.
|
 |
16
|
Hugo Larochelle , Dumitru Erhan , Aaron Courville , James Bergstra , Yoshua Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, Proceedings of the 24th international conference on Machine learning, p.473-480, June 20-24, 2007, Corvalis, Oregon
[doi> 10.1145/1273496.1273556]
|
| |
17
|
Peterson, G. B. (2004). A day of great illumination: B. F. Skinner's discovery of shaping. Journal of the Experimental Analysis of Behavior, 82, 317--328.
|
| |
18
|
Ranzato, M., Boureau, Y., & LeCun, Y. (2008). Sparse feature learning for deep belief networks. Adv. Neural Inf. Proc. Sys. 20 (pp. 1185--1192).
|
| |
19
|
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. Adv. Neural Inf. Proc. Sys. 19 (pp. 1137--1144).
|
| |
20
|
Rohde, D., & Plaut, D. (1999). Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition, 72, 67--109.
|
| |
21
|
Salakhutdinov, R., & Hinton, G. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. AI & Stat. '2007.
|
| |
22
|
Salakhutdinov, R., & Hinton, G. (2008). Using Deep Belief Nets to learn covariance kernels for Gaussian processes. Adv. Neural Inf. Proc. Sys. 20 (pp. 1249--1256).
|
 |
23
|
|
| |
24
|
Sanger, T. D. (1994). Neural network learning control of robot manipulators using gradually increasing task difficulty. IEEE Trans. on Robotics and Automation, 10.
|
| |
25
|
Schwenk, H., & Gauvain, J.-L. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. International Conference on Acoustics, Speech and Signal Processing (pp. 765--768). Orlando, Florida.
|
| |
26
|
Skinner, B. F. (1958). Reinforcement today. American Psychologist, 13, 94--99.
|
| |
27
|
|
 |
28
|
Pascal Vincent , Hugo Larochelle , Yoshua Bengio , Pierre-Antoine Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning, p.1096-1103, July 05-09, 2008, Helsinki, Finland
[doi> 10.1145/1390156.1390294]
|
 |
29
|
|
| |
30
|
|
|