|
ABSTRACT
The Cornell Laboratory of Ornithology's mission is to interpret and conserve the earth's biological diversity through research, education, and citizen science focused on birds. Over the years, the Lab has accumulated one of the largest and longest-running collections of environmental data sets in existence. The data sets are not only large, but also have many attributes, contain many missing values, and potentially are very noisy. The ecologists are interested in identifying which features have the strongest effect on the distribution and abundance of bird species as well as describing the forms of these relationships. We show how data mining can be successfully applied, enabling the ecologists to discover unanticipated relationships. We compare a variety of methods for measuring attribute importance with respect to the probability of a bird being observed at a feeder and present initial results for the impact of important attributes on bird prevalence.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
L. Breiman. Random forests. Technical Report 567, University of California Berkeley, Statistics Department, 2001.
|
| |
4
|
W. Buntine. Artificial Intelligence Frontiers in Statistics, chapterLearning Classification Trees. Chapman and Hall, 1993.
|
| |
5
|
R. Caruana, A. Niculescu, B. Rao, and C. Simms. Evaluating the C -section rate of different physician practices: Using machine learning to model standard practice. In The American Medical Informatics Conference (AMIA), 2003.
|
| |
6
|
J. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5):1189--1232, 2001.
|
| |
7
|
J. Friedman and B. Popescu. Predictive learning via rule ensembles. Technical report, Stanford University, 2005.
|
| |
8
|
|
| |
9
|
G. Hooker. Generalized functional ANOVA diagnostics for high dimensionalfunctions of dependent variables. Available at http://ego.psych.mcgill.ca/perpg/pstdc/giles, 2005.
|
| |
10
|
K. Kira and L. Rendell. The feature selection problem: Traditional methods and a newalgorithm. In Proc. Int. Conf. on Artificial Intelligence (AAAI), 1992.
|
| |
11
|
R. Kohavi and G. John. The wrapper approach. Artificial Intelligence, 97(1--2), 1997.
|
| |
12
|
E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Chapman and Hall/CRC, 1989.
|
| |
13
|
P. McCullagh and J. A. Nelder. Generalized Linear Models. Mcgraw-Hill, 1989.
|
 |
14
|
|
| |
15
|
|
CITED BY 2
|
|
Kurt Luther , Scott Counts , Kristin B. Stecher , Aaron Hoff , Paul Johns, Pathfinder: an online collaboration environment for citizen scientists, Proceedings of the 27th international conference on Human factors in computing systems, April 04-09, 2009, Boston, MA, USA
|
|
|
Daria Sorokina , Rich Caruana , Mirek Riedewald , Daniel Fink, Detecting statistical interactions with additive groves of trees, Proceedings of the 25th international conference on Machine learning, p.1000-1007, July 05-09, 2008, Helsinki, Finland
|
|