|
ABSTRACT
We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them. This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for data mining methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms, hardware solutions, parallel models and data-intensive computing. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
Jiang, M., Ryu, J., Kiraly, M., Duke, K., Reinke, V., and Kim, S.K., (2001). Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans. Proc. Natl. Acad. Sci. USA 98, 218--223
|
| |
4
|
Bernadó, E., Ho, T.K., Domain of Competence of XCS Classifier System in Complexity Measurement Space, IEEE Transactions on Evolutionary Computation, 9: 82--104, 2005.
|
| |
5
|
Physicists brace themselves for lhc 'data avalanche'." www.nature.com/news/2008/080722/full/news.2008.967.html
|
| |
6
|
M. Pop and S. L. Salzberg, "Bioinformatics challenges of new sequencing technology," Trends in Genetics, vol. 24, no. 3, pp. 142 -- 149, 2008
|
| |
7
|
|
| |
8
|
K. Sastry, "Principled Efficiency-Enhancement Techniques", GECCO-2005 Tutorial
|
| |
9
|
|
| |
10
|
J. Bacardit, Pittsburgh Genetics-Based Machine Learning in the Data Mining era: Representations, generalization, and run-time. PhD thesis, Ramon Llull University, Barcelona, Spain, 2004
|
| |
11
|
Jaume Bacardit, David E. Goldberg, Martin V. Butz, Xavier Llorà and Josep M. Garrell, Speeding-up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy, 8th International Conference on Parallel Problem Solving from Nature - PPSN VIII
|
| |
12
|
D. Song, M.I. Heywood and A.N. Zincir-Heywood, Training genetic programming on half a million patterns: an example from anomaly detection, IEEE Transactions on Evolutionary Computation, vol. 9, no. 3, pp 225--239, 2005
|
| |
13
|
Llora, X., Priya, A., and Bhragava, R. (2007), Observer-Invariant Histopathology using Genetics-Based Machine Learning. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2007), 2098--2105
|
| |
14
|
Giráldez R, Aguilar-Ruiz JS, Santos JCR (2005) Knowledge-based fast evaluation for evolutionary learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C 35(2):254--261
|
| |
15
|
J. Bacardit, E. K. Burke, and N. Krasnogor. Improving the scalability of rule-based evolutionary learning. Memetic Computing, in press, 2009.
|
 |
16
|
Martin V. Butz , Pier Luca Lanzi , Xavier Llorà , Daniele Loiacono, An analysis of matching in learning classifier systems, Proceedings of the 10th annual conference on Genetic and evolutionary computation, July 12-16, 2008, Atlanta, GA, USA
[doi> 10.1145/1389095.1389359]
|
 |
17
|
Xavier Llorà , Kumara Sastry , Tian-Li Yu , David E. Goldberg, Do not match, inherit: fitness surrogates for genetics-based machine learning techniques, Proceedings of the 9th annual conference on Genetic and evolutionary computation, July 07-11, 2007, London, England
[doi> 10.1145/1276958.1277319]
|
| |
18
|
Orriols-Puig, A., Bernadó-Mansilla, E., Sastry, K., and Goldberg, D. E. Substructrual surrogates for learning decomposable classification problems: implementation and first results. 10th International Workshop on Learning Classifier Systems, 2007
|
| |
19
|
J. Bacardit and N. Krasnogor, Performance and Efficiency of Memetic Pittsburgh Learning Classifier Systems, Evolutionary Computation Journal, 17(3):(to appear), 2009
|
| |
20
|
G. Wilson and W. Banzhaf, "Linear genetic programming gpgpu on microsoft's xbox 360," in Proceedings of the 2008 Congress on Evolutionary Computation, pp. 378--385. IEEE Press, 2008
|
| |
21
|
|
| |
22
|
Jaume Bacardit , Natalio Krasnogor, Empirical Evaluation of Ensemble Techniques for a Pittsburgh Learning Classifier System, Learning Classifier Systems: 10th International Workshop, IWLCS 2006, Seattle, MA, USA, July 8, 2006 and 11th International Workshop, IWLCS 2007, London, UK, July 8, 2007, Revised Selected Papers, Springer-Verlag, Berlin, Heidelberg, 2008
[doi> 10.1007/978-3-540-88138-4_15]
|
| |
23
|
|
 |
24
|
Jaume Bacardit , Michael Stout , Jonathan D. Hirst , Kumara Sastry , Xavier Llorà , Natalio Krasnogor, Automated alphabet reduction method with evolutionary algorithms for protein structure prediction, Proceedings of the 9th annual conference on Genetic and evolutionary computation, July 07-11, 2007, London, England
[doi> 10.1145/1276958.1277033]
|
| |
25
|
|
| |
26
|
|
| |
27
|
J. Rissanen J. Modeling by shortest data description. Automatica vol. 14:465--471, 1978
|
| |
28
|
|
| |
29
|
Alba, E., Ed. Parallel Metaheuristics. Wiley, 2007.
|
| |
30
|
|
 |
31
|
|
| |
32
|
Llora, X. Genetic Based Machine Learning using Fine-grained Parallelism for Data Mining. PhD thesis, Enginyeria i Arquitectura La Salle. Ramon Llull University, Barcelona, February, 2002.RFC2413, The Dublin Core Metadata Initiative, 2008.
|
| |
33
|
Xavier Llorà , Bernie Ács , Loretta S. Auvil , Boris Capitanu , Michael E. Welge , David E. Goldberg, Meandre: Semantic-Driven Data-Intensive Flows in the Clouds, Proceedings of the 2008 Fourth IEEE International Conference on eScience, p.238-245, December 07-12, 2008
[doi> 10.1109/eScience.2008.172]
|
| |
34
|
M. Butz, Rule-Based Evolutionary Online Learning Systems: A Principled Approach toLCS Analysis and Design, Studies in Fuzziness and Soft Computing, vol 109. Springer, 2006
|
| |
35
|
Hadoop (http://hadoop.apache.org/core/)
|
| |
36
|
Meandre (http://seasr.org/meandre)
|
| |
37
|
|
|