|
ABSTRACT
Generalization is a very important issue in Machine Learning. In this paper, we present a new idea for improving Genetic Programming generalization ability. The idea is based on a dynamic two-layered selection algorithm and it is tested on a real-life drug discovery regression application. The algorithm begins using root mean squared error as fitness and the usual tournament selection. A list of individuals called ``repulsors'' is also kept in memory and initialized as empty. As an individual is found to overfit the training set, it is inserted into the list of repulsors. When the list of repulsors is not empty, selection becomes a two-layer algorithm: individuals participating to the tournament are not randomly chosen from the population but are themselves selected, using the average dissimilarity to the repulsors as a criterion to be maximized. Two kinds of similarity/dissimilarity measures are tested for this aim: the well known structural (or edit) distance and the recently defined subtree crossover based similarity measure. Although simple, this idea seems to improve Genetic Programming generalization ability and the presented experimental results show that Genetic Programming generalizes better when subtree crossover based similarity measure is used, at least for the test problems studied in this paper.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Francesco Archetti , Stefano Lanzeni , Enza Messina , Leonardo Vanneschi, Genetic programming for human oral bioavailability of drugs, Proceedings of the 8th annual conference on Genetic and evolutionary computation, July 08-12, 2006, Seattle, Washington, USA
[doi> 10.1145/1143997.1144042]
|
| |
2
|
|
| |
3
|
|
| |
4
|
E. Burke, S. Gustafson, and G. Kendall. Diversity in genetic programming: An analysis of measures and correlation with fitness. IEEE Transactions on Evolutionary Computation, 8(1):47--62, 2004.
|
| |
5
|
P. Collard, M. Clergue, and F. Bonnin. Misleading functions designed from alternation. In Congress on Evolutionary Computation (CEC'2000), pages 1056--1063. IEEE Press, Piscataway, NJ, 2000.
|
 |
6
|
|
| |
7
|
A. E. Eiben and M. Jelasity. A critical note on experimental research methodology in EC. In Congress on Evolutionary Computation (CEC'02), pages 582--587, Honolulu, Hawaii, USA, 2002. IEEE Press, Piscataway, NJ.
|
| |
8
|
|
| |
9
|
F. Yoshida and J. G. Topliss. QSAR model for drug human oral bioavailability. Journal of Medicinal Chemistry, 43:2575--2585, 2000.
|
| |
10
|
|
| |
11
|
C. Gagné, M. Schoenauer, M. Parizeau, and M. Tomassini. Genetic programming, validation sets, and parsimony pressure. In P. Collet et al., editor, Genetic Programming, 9th European Conference, EuroGP2006, Lecture Notes in Computer Science, LNCS 3905, pages 109--120. Springer, Berlin, Heidelberg, New York, 2006.
|
| |
12
|
S. Gustafson. An Analysis of Diversity in Genetic Programming. PhD thesis, School of Computer Science and Information Technology, University of Nottingham, Nottingham, England, Feb. 2004.
|
| |
13
|
|
| |
14
|
S. Gustafson and L. Vanneschi. Operator-based distance for genetic programming: Subtree crossover distance. In Keijzer, M., et al., editor, Genetic Programming, 8th European Conference, EuroGP2005, Lecture Notes in Computer Science, LNCS 3447, pages 178--189. Springer, Berlin, Heidelberg, New York, 2005.
|
| |
15
|
S. Gustafson and L. Vanneschi. Operator-based tree distance in genetic programming. IEEE Transactions on Evolutionary Computation, 12:4, 2008.
|
| |
16
|
H. Van de Waterbeemd and S. Rose. In The Practice of Medicinal Chemistry, 2nd edition. ed. Wermuth, L. G., 1367--1385,Academic Press, 2003.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
N. McPhee and N. Hopper. Analysis of genetic diversity through population history. In W. Banzhaf et al., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 1112--1120, FL, USA, 1999. Morgan Kaufmann.
|
| |
21
|
T. Mitchell. Machine Learning. McGraw Hill, New York, 1996.
|
| |
22
|
|
| |
23
|
R. Poli and W. B. Langdon. Genetic programming with one-point crossover and point mutation. Technical Report CSRP-97-13, University of Birmingham, B15 2TT, UK, 15 1997.
|
| |
24
|
|
| |
25
|
R. Todeschini and V. Consonni. Handbook of Molecular Descriptors. Wiley-VCH, Weinheim, 2000.
|
| |
26
|
J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.
|
| |
27
|
S. David, Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali,P. Stothard, Z. Chang and J. Woolsey. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Research, 34:doi:10.1093/nar/gkj067, 2006.
|
| |
28
|
Simulation Plus Inc. a company that use both statistical methods and differential equations based simulations for ADME parameter estimation., 2006. See www.simulationsplus.com.
|
| |
29
|
Smola Alex J. and Bernhard Scholkopf. A Tutorial on Support Vector Regression. Technical Report Technical Report Series - NC2-TR-1998-030, NeuroCOLT2, 1999.
|
| |
30
|
|
| |
31
|
M. Tomassini, L. Vanneschi, F. Fernández, and G. Galeano. A study of diversity in multipopulation genetic programming. In 6th International Conference on Evolutionary Computation EA'03, pages 69--81, 2003.
|
| |
32
|
L. Vanneschi. Theory and Practice for Efficient Genetic Programming. Ph.D. thesis, Faculty of Sciences, University of Lausanne, Switzerland, 2004.
|
| |
33
|
L. Vanneschi, S. Gustafson, and G. Mauri. Using subtree crossover distance to investigate genetic programming dynamics. In Collet, P., et al., editor, Genetic Programming, 9th European Conference, EuroGP2006, Lecture Notes in Computer Science, LNCS 3905, pages 238--249. Springer, Berlin, Heidelberg, New York, 2006.
|
| |
34
|
L. Vanneschi, M. Tomassini, P. Collard, and M. Clergue. Fitness distance correlation in structural mutation genetic programming. In C. Ryan et al., editors, Genetic Programming, Proceedings of the European Conference, volume 2610 of LNCS, pages 459--468, Essex, 14--16 Apr. 2003. Springer-Verlag.
|
| |
35
|
W. B. Langdon and S. J. Barrett. Genetic Programming in data mining for drug discovery. in Evolutionary computing in data mining, pages 211--235, 2004.
|
|